Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp3531025ybh; Tue, 17 Mar 2020 01:45:11 -0700 (PDT) X-Google-Smtp-Source: ADFU+vvEuR1Ua1BIUnxs9yF8aYEskPNHk2KqNfuA2HT+6F2Dm/kcE5A2uLspA+VTRJDLRZaJfT5J X-Received: by 2002:a9d:63d2:: with SMTP id e18mr2738221otl.277.1584434711460; Tue, 17 Mar 2020 01:45:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584434711; cv=none; d=google.com; s=arc-20160816; b=sRpDwbZtLhVnQ9b4QI/BwdFqZyG7uiVPmI/ggP7CWC6E78FYKyNRKC6P2zAAzz7dpU 9pAIjB6jpzOpJiiTiw78kOSnz+0bfN7T20VpUGV2uwP+w4IC12FZXD+JLATctKp/WZx1 Z1IRVe/3YDUmtWxKDUo7VHMjJ1jAjI2uQeGzjn5cCzkuhgexTH5ID52emuUHQROmgX52 s88BGF4QJLSPmMV/CsFihGEVh9jDBpx+zfhCYbn8LJlEnT3F3YcarAfzx3ek56MV30q7 j2gBEzNfaTNvc/Y+B5pMrb+Nz8CP/lZOUpdhO9GIYHdBWS+jKy1B32btq9uujXewEzrw rj8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=TJ8vFaHkwrrpX/aujjdYSzZRE1W/Od805cD1Z+B2/vc=; b=ZHpQpHvvA5ZQCbyZFOEnCyxc8E7S31PKPxi4WONZkbhmzYMdb+epv/0KjzRbXOQW+p IoBjxgmjoGhKoPtjwlXthHQtavdYHrqtzX+yOX5xZmZsm1jrogsyKkyr3rF7meV3Q5Fj xhMvga+RFlw1B+hVk8wlDG4daMqnaJG2nNqWYPuxObCpwsS1ub7fo8K4gm8rrmE4BHSo 1dRXt7h7yTu78aGVZdu+jr7swFmc7iNCC5Ext9mYbrH5NV8kiJvp/7OlkQAC4yHhMDe6 0DpNfTwdK4Rsd+QnipW6BnF8xQE+EyXsBou5HXUu7ibEfDbVPlcO1e5gHSjr8hOo8RWh JuGA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c14si1468408otm.146.2020.03.17.01.44.58; Tue, 17 Mar 2020 01:45:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726066AbgCQIoH (ORCPT + 99 others); Tue, 17 Mar 2020 04:44:07 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:44242 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725862AbgCQIoG (ORCPT ); Tue, 17 Mar 2020 04:44:06 -0400 Received: from ip5f5bf7ec.dynamic.kabel-deutschland.de ([95.91.247.236] helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jE7pG-0006pt-Ix; Tue, 17 Mar 2020 08:43:50 +0000 Date: Tue, 17 Mar 2020 09:43:49 +0100 From: Christian Brauner To: Adrian Reber Cc: Eric Biederman , Pavel Emelyanov , Oleg Nesterov , Dmitry Safonov <0x7f454c46@gmail.com>, Andrei Vagin , linux-kernel@vger.kernel.org, Mike Rapoport , Radostin Stoyanov , Michael Kerrisk , Arnd Bergmann , Cyrill Gorcunov , Thomas Gleixner Subject: Re: clone3: allow creation of time namespace with offset Message-ID: <20200317084349.fkmpj4tpdmsv6trj@wittgenstein> References: <20200317083043.226593-1-areber@redhat.com> <20200317084154.m2u76jqj5f47mxqc@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200317084154.m2u76jqj5f47mxqc@wittgenstein> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 17, 2020 at 09:41:55AM +0100, Christian Brauner wrote: > On Tue, Mar 17, 2020 at 09:30:40AM +0100, Adrian Reber wrote: > > This is an attempt to add time namespace support to clone3(). I am not > > really sure which way clone3() should handle time namespaces. The time > > namespace through /proc cannot be used with clone3() because the offsets > > for the time namespace need to be written before a process has been > > created in that time namespace. This means it is necessary to somehow > > tell clone3() the offsets for the clocks. > > > > The time namespace offers the possibility to set offsets for > > CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend > > 'struct clone_args` with '__aligned_u64 monotonic_offset' and > > '__aligned_u64 boottime_offset'. The problem with this approach was that > > it was not possible to set nanoseconds for the clocks in the time > > namespace. > > > > One of the motivations for clone3() with CLONE_NEWTIME was to enable > > CRIU to restore a process in a time namespace with the corresponding > > offsets. And although the nanosecond value can probably never be > > restored to the same value it had during checkpointing, because the > > clock keeps on running between CRIU pausing all processes and CRIU > > actually reading the value of the clocks, the nanosecond value is still > > necessary for CRIU to not restore a process where the clock jumps back > > due to CRIU restoring it with a nanonsecond value that is too small. > > > > Requiring nanoseconds as well as seconds for two clocks during clone3() > > means that it would require 4 additional members to 'struct clone_args': > > > > __aligned_u64 tls; > > __aligned_u64 set_tid; > > __aligned_u64 set_tid_size; > > + __aligned_u64 boottime_offset_seconds; > > + __aligned_u64 boottime_offset_nanoseconds; > > + __aligned_u64 monotonic_offset_seconds; > > + __aligned_u64 monotonic_offset_nanoseconds; > > }; > > > > To avoid four additional members to 'struct clone_args' this patchset > > uses another approach: > > > > __aligned_u64 tls; > > __aligned_u64 set_tid; > > __aligned_u64 set_tid_size; > > + __aligned_u64 timens_offset; > > + __aligned_u64 timens_offset_size; > > Hm, so for set_tid we did set_tid and set_tid_size which makes sense > because set_tid wasn't actually a struct. But I'm not a fan of > establishing a pattern whereby we always have to grow two member, the > object and it's size; at least when we're adding a struct. > So at a first glance here are two possible ideas: > - Don't add a size argument and assume that struct timens_offset won't > grow. I'm not sure how likely it is it will grow. > - Make the size the first member of struct timens_offset the size of the > struct. (See examples for this pattern in the sched syscalls.) Oh, and I should point out right way that I consider this material for the v5.8 merge window. It's too late in this cycle to land this with any confidence in v5.7. Just so there's no disappointment. :) The good news is that this leaves us with ample time to figure this out. Christian