Received: by 2002:a25:b794:0:0:0:0:0 with SMTP id n20csp862645ybh; Sat, 3 Aug 2019 11:29:17 -0700 (PDT) X-Google-Smtp-Source: APXvYqxJirbQMr2BbVxIzNV2zbarcPQ10yQOSitxRllJj1N71qYnrkRnHVIcPnGlKKiD+xtN21BV X-Received: by 2002:a62:640c:: with SMTP id y12mr64626143pfb.166.1564856957792; Sat, 03 Aug 2019 11:29:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564856957; cv=none; d=google.com; s=arc-20160816; b=tjmSelPEHUygafK/rjpyFXqofqnFb6mlXNpdMErB6oU113DRuTS8WPckUKMaN7UH/f 82PmjZrLwFfK/u5Yl+JZr8t1srjKPKmTbRgRpbVh7peislKK4n8LpvSe3Hv82eGD6rG9 vVpbhiTQVKnId8QaypG6hnbezMuIcaydCTILjWFBiSr318axk93c+r4hUzHwnqiggyZe NLBRFzBEmg+EQNrRo+BKyi/49dtObYwFizITjLnVFSDec0u29qrXcoHNruqBZLoXx9ZY DjNTW7LUzhdg7t6ESfOZjuZzFlo1x9WQFQNHFqU4O0UWm/wN6UEFT8WetQLiYbcG/ESd 834g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:organization:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=I8M41Q1gugA+2Wghi2diHf9FxSnyXvBvslCdDT/hVgk=; b=GDTy5EXK4fOESpNxmX+W8KbwlKOYdtPAxqxcpYUJcryp8n60rTh28112GfR3CCJpxl VfukwkS/kFfOWX16DD5R70gdXQHIocXVMOkNo/A3xvYERSFdUuVEGpo5Fah/WYm0VYq8 F7hArScTUITiaSW/l5knYn4IFDaL5wNtzFKQhqou4SFWaKF5Db/8VpQQdzNxxpfFbMd5 lFPFcgWmgNTelOTX2Rd/RbpXvU9nyZpnXm1cwyOQzeRBy7T9wOiHB41pH8MuBCrcohjq R2YqGJNYkU7d6GHo6RVnJTvGhOkP7x3WzfzsFJOl4YA0aPJzffwO9lctGPQKSAlTgkK7 JzFg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h99si9053985pje.83.2019.08.03.11.29.03; Sat, 03 Aug 2019 11:29:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728565AbfHBPKa (ORCPT + 99 others); Fri, 2 Aug 2019 11:10:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48762 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726044AbfHBPK3 (ORCPT ); Fri, 2 Aug 2019 11:10:29 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DF6383E2BB; Fri, 2 Aug 2019 15:10:28 +0000 (UTC) Received: from dcbz.redhat.com (ovpn-116-74.ams2.redhat.com [10.36.116.74]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 885FB5D9E2; Fri, 2 Aug 2019 15:10:11 +0000 (UTC) Date: Fri, 2 Aug 2019 17:10:09 +0200 From: Adrian Reber To: Christian Brauner Cc: Oleg Nesterov , Eric Biederman , Pavel Emelianov , Jann Horn , Dmitry Safonov <0x7f454c46@gmail.com>, linux-kernel@vger.kernel.org, Andrei Vagin , Mike Rapoport , Radostin Stoyanov Subject: Re: [PATCH v2 1/2] fork: extend clone3() to support CLONE_SET_TID Message-ID: <20190802151009.GE18263@dcbz.redhat.com> References: <20190731161223.2928-1-areber@redhat.com> <20190802131943.hkvcssv74j25xmmt@brauner.io> <20190802133001.GE20111@redhat.com> <20190802135050.fx3tbynztmxbmqik@brauner.io> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190802135050.fx3tbynztmxbmqik@brauner.io> X-Operating-System: Linux (5.1.19-300.fc30.x86_64) X-Load-Average: 1.75 1.91 1.91 X-Unexpected: The Spanish Inquisition X-GnuPG-Key: gpg --recv-keys D3C4906A Organization: Red Hat User-Agent: Mutt/1.12.0 (2019-05-25) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Fri, 02 Aug 2019 15:10:29 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 02, 2019 at 03:50:54PM +0200, Christian Brauner wrote: > On Fri, Aug 02, 2019 at 03:30:01PM +0200, Oleg Nesterov wrote: > > On 08/02, Christian Brauner wrote: > > > > > > On Wed, Jul 31, 2019 at 06:12:22PM +0200, Adrian Reber wrote: > > > > The main motivation to add CLONE_SET_TID to clone3() is CRIU. > > > > > > > > To restore a process with the same PID/TID CRIU currently uses > > > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > > > > ns_last_pid and then (quickly) does a clone(). This works most of the > > > > time, but it is racy. It is also slow as it requires multiple syscalls. > > > > > > Can you elaborate how this is racy, please. Afaict, CRIU will always > > > usually restore in a new pid namespace that it controls, right? > > > > Why? No. For example you can checkpoint (not sure this is correct word) > > a single process in your namespace, then (try to restore) it. > > > > > What is > > > the exact race? > > > > something else in the same namespace can fork() right after criu writes > > the pid-for-restore into ns_last_pid. > > Ok, that makes sense. :) > My CRIU userspace knowledge is sporadic, so I'm not sure how exactly it > restores process trees in pid namespaces and what workloads this would > especially help with. Just what Oleg said. CRIU can restore processes in a new PID namespaces or in an existing. To restore a process into an existing PID namespace has the possibility of a PID collision, but if the PID is not yet in use there is no limitation from CRIU's side. Restoring into an existing PID namespace which is used by other processes always has the possibility that between writing to /proc/sys/kernel/ns_last_pid and clone() something else has fork()'d and therefore it is racy. Adrian