Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp6668422ybx; Mon, 11 Nov 2019 12:42:47 -0800 (PST) X-Google-Smtp-Source: APXvYqywxpRnCvGsW2LijI0tW2NgTgntT2dn/iy+nJz9/7iNH8VADnkkyRyVCca+jyRnc8FUQSBE X-Received: by 2002:a17:906:edd2:: with SMTP id sb18mr25142051ejb.112.1573504967460; Mon, 11 Nov 2019 12:42:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573504967; cv=none; d=google.com; s=arc-20160816; b=l8ZQMqxXBkHcYZ0emrl1WUk1oz8p/RhnbAJ8ICrzouQrWFrIqKqsw5aOnLOYZC8PZ/ 3BiXne9qWu+r5xInuj4YiVu1nyntngq7rXaRqvpGYsOrPKKyS13z+yuHKxObTwu4d2Zm DxMugQ2ayQ5bjHANTBVfKRneZmp1+z60B1GFrBm9+Oq51YpBLiYVIH9qR4ZvA7ue4Zgf z0IttrIJ235X3LqXcglfNPBR13nu4mAOOm7NEKGMowR/c2pwHJXUwcHNJg4ZA2RzomkV BL8sDfjUYo6vmMSrKWyNT5gXJb3gUiDrd2Fq7dwkElzKEQuhJ/XQtFGz5WnNPKg6rz4P Mvcw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=8K8x18JCJt+y0F/d1syChnD1Af3sC+4itc7RyBI12Eg=; b=OEcs1R9nGiT1klBxa+eud1ng7+9Ci5uk8VrYcSy0czm8E31AZkRaW7hMDwgFb4ZVIj Q22Rf2VCKb//2ZziXnyF2PpC0z1exIdMkI6BxsjWOQZE53Y/MYltq3TLpARkG7k3sm0S pOwvZqUQM6wYd5aiikTwDXUzOGtvPQVk3NwscjiNdDQwR4Pam0gfuxgAW63sIwroz5pL 3JmMhq67Vx3jbnj0NqqiSCT2BJzfY43EKSVkS9Bf1AckRlzZlXomSyZ/G7IttDoBtWIo Vlla2DroyKr3d7D1rx9OCTm0vhtaKr07oQvqlXCmCVkMg/LqEuVz/bcrahZd8W/p2Xms lAnA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@rasmusvillemoes.dk header.s=google header.b=ISCoYDFH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t11si10016564eju.106.2019.11.11.12.42.22; Mon, 11 Nov 2019 12:42:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@rasmusvillemoes.dk header.s=google header.b=ISCoYDFH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727355AbfKKUln (ORCPT + 99 others); Mon, 11 Nov 2019 15:41:43 -0500 Received: from mail-wr1-f67.google.com ([209.85.221.67]:45029 "EHLO mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726877AbfKKUln (ORCPT ); Mon, 11 Nov 2019 15:41:43 -0500 Received: by mail-wr1-f67.google.com with SMTP id f2so16089735wrs.11 for ; Mon, 11 Nov 2019 12:41:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rasmusvillemoes.dk; s=google; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=8K8x18JCJt+y0F/d1syChnD1Af3sC+4itc7RyBI12Eg=; b=ISCoYDFHCnsxDIGOLnL3ikYvobtu5KjID6cfMKw+qehnDoiXOXDShQJtf8hBXsbUQf TGtyrXdClPy+S0sId7BfnX+zW7BOUoewe4xYcDZ+EDW8XVO3bc1iVw9m/OqNtZfakd69 bU1WUG06oT8Dak6BW5N19UEQgoSdM99G6HLTs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=8K8x18JCJt+y0F/d1syChnD1Af3sC+4itc7RyBI12Eg=; b=uiqOs6JoTq3Nq673rjB1VbKPDit3GfppBbOhiT6QIMP6ImpTLY1i6WA226w7N+hP1O FBmczvrsp/lnNgFgjWWTSNHNoxDeP6Zwek2z9yLrGO6hYWwiCblRafbkFbY9ueiiLA0C DlW5flKaoU+/upTit/VpqF8IX1V0MjuyRn69/pxoJJ8YGmU3WTBdyyro7oMNO5LtUdV7 bWSKlDQQkY+12kj/Rn2xMCDf2tvUpmb95ZvnBaYv1HlSYEu8M3pKlZdIyOBWY5YG8e0k JqrHtobp/ochxbbVC+ckAlGpn9jydR7ZYlHWftl5YJO1jWuAtfdnqSfWDpXz4js1oldv PYaw== X-Gm-Message-State: APjAAAVBuqkFIwC+LUGfKYtWAzyVLZ1V+3d3vygbTKXnqMlly8hN7JUH 72YMIsOP5Cee6h6+m92KG3JczX75suGlNNBK X-Received: by 2002:a05:6000:103:: with SMTP id o3mr24351465wrx.80.1573504900980; Mon, 11 Nov 2019 12:41:40 -0800 (PST) Received: from [192.168.1.149] (ip-5-186-115-54.cgn.fibianet.dk. [5.186.115.54]) by smtp.gmail.com with ESMTPSA id o189sm1020173wmo.23.2019.11.11.12.41.39 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Nov 2019 12:41:40 -0800 (PST) Subject: Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID To: Adrian Reber , Christian Brauner , Eric Biederman , Pavel Emelyanov , Jann Horn , Oleg Nesterov , Dmitry Safonov <0x7f454c46@gmail.com> Cc: linux-kernel@vger.kernel.org, Andrei Vagin , Mike Rapoport , Radostin Stoyanov References: <20191111131704.656169-1-areber@redhat.com> From: Rasmus Villemoes Message-ID: Date: Mon, 11 Nov 2019 21:41:39 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20191111131704.656169-1-areber@redhat.com> Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/11/2019 14.17, Adrian Reber wrote: > The main motivation to add set_tid to clone3() is CRIU. > > To restore a process with the same PID/TID CRIU currently uses > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > ns_last_pid and then (quickly) does a clone(). This works most of the > time, but it is racy. It is also slow as it requires multiple syscalls. > > Extending clone3() to support *set_tid makes it possible restore a > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and > race free (as long as the desired PID/TID is available). > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN) > on clone3() with *set_tid as they are currently in place for ns_last_pid. > > The original version of this change was using a single value for > set_tid. At the 2019 LPC, after presenting set_tid, it was, however, > decided to change set_tid to an array to enable setting the PID of a > process in multiple PID namespaces at the same time. If a process is > created in a PID namespace it is possible to influence the PID inside > and outside of the PID namespace. Details also in the corresponding > selftest. > > /* > * Verify that higher 32bits of exit_signal are unset and that > * it is a valid signal > @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, > .stack = args.stack, > .stack_size = args.stack_size, > .tls = args.tls, > + .set_tid = kargs->set_tid, > + .set_tid_size = args.set_tid_size, > }; This is a bit ugly. And is it even well-defined? I mean, it's a bit similar to the "i = i++;". So it would be best to avoid. > + for (i = 0; i < args.set_tid_size; i++) { > + if (copy_from_user(&kargs->set_tid[i], > + u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))), > + sizeof(pid_t))) > + return -EFAULT; > + } > + If I'm reading this (and your test case) right, you expect the user pointer to point at an array of u64, and here you're copying the first half of each u64 to the pid_t array. That only works on little-endian. It seems more obvious (since I don't think there's any disagreement anywhere on sizeof(pid_t)) to expect the user pointer to point at an array of pid_t and then simply copy_from_user() the whole thing in one go. > return 0; > } > > @@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) > int err; > > struct kernel_clone_args kargs; > + pid_t set_tid[MAX_PID_NS_LEVEL]; > + > + memset(set_tid, 0, sizeof(set_tid)); > + kargs.set_tid = set_tid; Hm, isn't it a bit much to add two cachelines (and dirtying them via the memset) to the stack footprint of clone3, considering that almost nobody (relatively speaking) will use this? So how about copy_clone_args_from_user() does if (args.set_tid) { set_tid = memdup_user(u64_to_user_ptr(), ...) if (IS_ERR(set_tid)) return PTR_ERR(set_tid); kargs.set_tid = set_tid; } Then somebody needs to free that, but this is probably not the last clone extension that might need extra data, so one could do s/long _do_fork/static long __do_fork/ and then create a _do_fork that always cleans up the passed-in kargs, i.e. long _do_fork(struct kargs *args) { long ret = __do_fork(args); kfree(args->set_tid); return ret; } Rasmus