Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1425381AbdD1JNn (ORCPT ); Fri, 28 Apr 2017 05:13:43 -0400 Received: from mail-eopbgr30109.outbound.protection.outlook.com ([40.107.3.109]:22342 "EHLO EUR03-AM5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1423593AbdD1JNX (ORCPT ); Fri, 28 Apr 2017 05:13:23 -0400 Authentication-Results: chromium.org; dkim=none (message not signed) header.d=none;chromium.org; dmarc=none action=none header.from=virtuozzo.com; Subject: Re: [PATCH v3] pid_ns: Introduce ioctl to set vector of ns_last_pid's on ns hierarhy To: "Eric W. Biederman" References: <149329634856.21195.14196911999722279118.stgit@localhost.localdomain> <87mvb16fv7.fsf@xmission.com> <12a73543-79ea-4bac-7e96-6ab237534af2@virtuozzo.com> <877f254yx0.fsf@xmission.com> CC: , , , , , , , , , , , , , , , From: Kirill Tkhai Message-ID: Date: Fri, 28 Apr 2017 12:13:14 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <877f254yx0.fsf@xmission.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [195.214.232.6] X-ClientProxiedBy: DB5PR04CA0025.eurprd04.prod.outlook.com (10.164.34.163) To DB6PR0802MB2279.eurprd08.prod.outlook.com (10.172.228.7) X-MS-Office365-Filtering-Correlation-Id: 9a13a192-a6ff-430a-5833-08d48e16ce76 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(201703131423075)(201703031133081);SRVR:DB6PR0802MB2279; X-Microsoft-Exchange-Diagnostics: 1;DB6PR0802MB2279;3:G02cYXGcQgDJLptSqEyPok/5HJCWiC22vESDetiVLFW1XgCSxONNuxEN7KfBAtHuM+AIACD+OmfxbQDVtttKH24ccVb0EDPH5srbD9IQ4ez/evTuxSnwz8pcKVyZ8qWQQzVVHFxSmZV447EHOPBM0AQB8XdSzD/6ZVQKKLdpRNrN85mXk0vu6OKoNK8Zt4euRIi1O17omLJsCj993B+DVtiL6kQ1EQ3C19BeCBxcuJsKKaxOvjocRTCu2IyI7LwVc2tdHHID2yvInGUJs46K+tQzjTRcSq2Ke9nRqhd40yjHm0UqWhWNCRY4MGeVSbvspjyLN9ojlT3HKtZO8D99rg==;25:ji3FNsCJ+m3xKDhzBwbyTpamBM/xNUlK7lz7jso7/a04+04WokVTlm2rqflI4gJ0z/dmih2b9BV51DndnygDhh8pKJ+S9SYgCFE9ldui9zc98huDxCiCbCAtO4fga13KdwoIGEGzO9McLy9G8C/0cl/lJaC/VCtlsYcjh3HFDu08RFTyBzCrwP3qg9+ZRSp+QYOrhNO5yo8bfBq0ULr11q5oFHowMkjWpr66s7ayhIrTM8A/qj4877e72m1X4A6crrYwmFcs4eP4ZsBb9pu21HpHHXsoslhXEH6Yhhfi+ilKmu8oyUJMJzdFKP22g4UnukU9dHQhWEE7W5XvHBd/9gifYrvbRC5fBqs8F/NC7uDGypIEhwMn1pvhkTZXWMsKHGOKBa/gDayumBFmH0G8NM0WcXhkp0B6UC/vM7dOXz3ZTNAPFbB+lGicoSBkkWbM X-Microsoft-Exchange-Diagnostics: 1;DB6PR0802MB2279;31:BbmV/cGjH1tx1DTItHVN9++wjzxUfoXLYyxFhdliuavgZYSQvFinbyMGlyOg6nfMFrxgW1LMGdq4Cow95gEm5EEBWkcPp51y7uOmE9TKEknivQUtcvYdWcsDDaXOuazTuE93PHJe0I1lnLUG3kPTI6ceiHL3Sk8+j9lml1Bi27xmQHflg+Fy+XDCF4KBkX6dwCqL3+35nrAsvnZF9Li08IBLvmvc8qNq4nUhVDUH7D0=;20:drHp9GEE6a4w9eYCIPNH83CWnCKYUCDPybuYzrXwEMq5zqQkHtAlMjJZjuja6auYA4g8/rUtjcED2Hn75CosDBKFdG3FgiF50XC/JTDDBDKY9E2ii48swPQPkFAtJOWmyyXUUlgEDikexSfYT1a5/CpsCo3fB1VjebhPLv6G1rpvvlKzU8kWBri80M5xxm05GVi+bZV2b4sb2cUQjLG37qqUoRsnqVou8eXLcvJ4WC3Eee5F5+1WEczakgZNYC8UVj9NMhL+8NoX3y+H5pTwwROuegdbyo3NCIpILBG2hi2BnHXWnAcfpEHnO3rV/bliXe37AO1d0EdhdiszoeyFyum3BCZTG4/azPdnJc2vgyWDSUOSDQPo7i/FXI2+gzUk8zPC8CcBf7VERCXUDt6VGsTfB/289XmOZ86h+agKqvE= X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040450)(601004)(2401047)(8121501046)(5005006)(10201501046)(93006095)(93001095)(3002001)(6041248)(20161123555025)(20161123560025)(20161123564025)(20161123562025)(201703131423075)(201702281528075)(201703061421075)(6072148);SRVR:DB6PR0802MB2279;BCL:0;PCL:0;RULEID:;SRVR:DB6PR0802MB2279; X-Microsoft-Exchange-Diagnostics: 1;DB6PR0802MB2279;4:+hgf2z9tTVCUhdFnh/OPUiFsUWpyuGf1cnOaoyWxNQ3t6hVj1Y59AJdK4NFuXqozFT3RtIXbmwoTLTxb79GkisV/YsSOnGrHrC//tDobdPKDPpYgUGZr3JJ8pXzNIO/hkN74b53otKPqqsAde1Pk7oI6g94KZHu6ImHXd3ryH+jxyZGgMhynk2Anv1dE2qgQoRzLYsm3z+Tu0d1ElPbQnELXoajNIRAicPwfIDrmaLavyqfEigjsrUtsPSYpFEv8g/ILc3VMAGNpV/Rt1rePStrRNoqF0MqvPTa/qYq8HoU3+qVQkmTLAOgjGyZEcBIozJxQOOwu1X/bIx8b/CcxDT6k9Ye3NLAc5MTnYAWQvtABnszSUdgiMUY5e1usqpXRrt51b6klTMQs+cZjAx+zxc5CzLbF5EInrkHbPcRr/P1e3krtQ62tC+7JkIcK9i8wDl+rgREBBqXBAAFSSlzITBjkK+K1gFaTr+Xlp/at2ahVQwE0Ok35Zx1gL9QBgNxrlfOi3Fzp8UN/uGCzIVfPF7ml37riCztabHsdFKETC0/3Yk4mosC/EDqCq6w6ASQUZOqEnaXUcUUbLh0jdbpxAiMaBZWY7P/3lBHcFgWcZzAXicWlyTRp1XBLCIKegC5bWylgnXbxY/tT6sAWuRNM/e89Ghci+9Lme/J557GumsES+eqbTkomEX3qwcj0TVHmOcivtYfhlThyegFc+qus/A== X-Forefront-PRVS: 029174C036 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6049001)(6009001)(39410400002)(39400400002)(39450400003)(39840400002)(24454002)(6916009)(90366009)(36756003)(8676002)(229853002)(77096006)(50466002)(6486002)(2950100002)(65826007)(6246003)(83506001)(54906002)(5660300001)(81166006)(23676002)(31696002)(86362001)(575784001)(53936002)(189998001)(4001350100001)(7736002)(42186005)(93886004)(305945005)(4326008)(33646002)(230700001)(3846002)(53546009)(76176999)(50986999)(66066001)(54356999)(2906002)(65956001)(110136004)(38730400002)(6116002)(25786009)(47776003);DIR:OUT;SFP:1102;SCL:1;SRVR:DB6PR0802MB2279;H:[172.16.25.137];FPR:;SPF:None;MLV:sfv;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtEQjZQUjA4MDJNQjIyNzk7MjM6Vjdaa3NMSm9yOEkvMFpZejlVREMweGo0?= =?utf-8?B?NEFGeS9kM0JVbkszM0UwZmFleGxjekJLaVZCd2FLeDFTN0Z4ak5yT1JHcW1w?= =?utf-8?B?WTF2a0RUSFRMUGRVMGdmQnpKQVI3UG9hQlJOcDFKZ1ZiZ0J5TnZMUFk1dUVC?= =?utf-8?B?TXBLc0huSTk2d0xHd0dRbzhnRXB2bHpLSUlOaWphNVh5QUtuZ09VR1gyOTJU?= =?utf-8?B?ajgwUVRGUGRBdVo5cThSYm5vZURqUmZiaVBFZVI0bmhvb0J0cnRoTVhBbjFK?= =?utf-8?B?K0NvTmN1NU0yQkR1czd5UUJHR0kyOVo1c3NENTZRanAydVkrSTl0LytCa05U?= =?utf-8?B?Qko4T1VNQjdyQzl4MER5NDAvbE1jQzFENzZ2b0RWVkN6bjYzN1hSeFpZdnVT?= =?utf-8?B?N1ZmOUtENXgyZ2R5YWFWaWEzWUJ3YTU0ZTJGKzN0ZXZBc09oR0xZMGR6eHFJ?= =?utf-8?B?WlZod3ZEVmN2WDJ4SEVBYWZFbG1PUHZJeEN2SHZzMmwzRDlNeTk5ajZNSEoy?= =?utf-8?B?VWN0Y1FWMHcyOTM4MmpHV0lnNzNOYitVbitHbmFYbmhxeHNWYTRvUUc4MFJy?= =?utf-8?B?akIvVHdGVm41aGJhQ0JIeE1CUmU0Nm85SUFoT1gzVlVlN0pDN3UxZDBQRXEr?= =?utf-8?B?VDh4NVQ2V3F2MzBvS2VlWXVhVHE5WUx3WXNwYkxEK2NVUFBDejlhQmlJOU5X?= =?utf-8?B?Ni9HS0ROTnBvSzlqQTVjTTVOSnhPQmxNYXZ1c3FScldXZCtVdk5TdEJOMzhD?= =?utf-8?B?d3UzU0dMd3RBR29ZQWZlMlpNaS85ME9HaDdndW83NXNjdWh2bjJFcnE1eFlv?= =?utf-8?B?UWkzaGlNVjF2dmxyM3hhV01XWWV0SVBVTDJPaU5XcFVOOUZOSndGR0FLblNs?= =?utf-8?B?WHVaZ2ZtWmplMElNWmRjMEJyMTA2a1BmM0tTOElpcjFDSWRScjZyVDNOMzFx?= =?utf-8?B?Lzgxa0hjdUxtY1JVMmw4QTVjRGFaTUQ2L0phZ3ZXZjhXYmZFRmdKdTBTQXNo?= =?utf-8?B?RTFodDVpZE9iaDJLWklNa2RrQUVOOUR3SXVlNFN6VDRxYzhCWi9EbEx4Tm9O?= =?utf-8?B?TGhTOHVmRUtwMm9yZFQrRUxkZnVWQ2d4ZEVDczJzS0J5eEdvYTJBNm9qclN5?= =?utf-8?B?Smg5SlFSblloditFdEh1V2EzUWZuajZ0czZnWHZDcE42V1FzMlpaN25rckd2?= =?utf-8?B?S1JDRmlZQThCOFM2NnpvS3NqbVVKSXUyYkY1WERJY3k1anFZR0VqczdtbHp2?= =?utf-8?B?bHlhWlZtNFR6MGRzdnVML3FyVWJDbGRrR1h1TXNWdUpZa01qVklGV3FtZ0wz?= =?utf-8?B?NTNSaUx1WG1tR1dXTFpoZEowVDNTWllRbTRQWXV6cnZNSTZteHhRa05TL082?= =?utf-8?B?K2lRMlBmb3BKRFpjaVdEOTNYNll5dWQ2TmtLZ1JScTJBdS9qbmIwcW10VllS?= =?utf-8?B?R3NSZlYxWlJEN1VSMExwaUNwaXNqeVh1cGE4UE5PMEhSRENYcm5PQnJwKysv?= =?utf-8?B?ZTF2NW1vd0tvVmZWQXpwOTE5YWdNaTZrb0xKUENOTy9tUzNyL2xDYkgzRjZO?= =?utf-8?B?MjkweHNCblFMRGFpL2s4RndmL2lGL0RSb0tza2IxZUZqY2l2RnArT0lScTBm?= =?utf-8?B?cUtkbDc5VlZ2bnpaakhoREhmUDZWQlI0aHRZTENQaysvZ21MYVg0a1djUTMv?= =?utf-8?Q?quZtOUVLDIMf9wXYGayI=3D?= X-Microsoft-Exchange-Diagnostics: 1;DB6PR0802MB2279;6:QTBkPSKRWpbhKb2eT/ayaO81yxvKckNTnKXlxOLX736hLxyEoUIxeGoUHGq+IyLx3IduKPspeobx9y50Dm2Oz6rT3Mg3DEwrjW1Ynzv6XSNggadRA9t4VtL4+fuGKHmW/IYeUbOCCyRBWjyrAujLtFmU2f3hhfmGJdgutD3Euf80ykUK7Jq6GWKRi+u/zfAOn8w5Ai7VdqhYG36kQ6xOy1SncwfWpxrgpVFuRtJrjspwdPaVMnfMumQO9D7RqLPIwYFsqjQnK/At5UZJN6NJuIUwkB/fxaqHcFKOMz7iviByYsj7wOFXLFrNAf8Zytv4qZVzw0hKJw32ZiujZbrMlPCWIaX+aOhSOQtVVX/ILeZlJ/CYtl0uftqKpiN7qoOTxOth7AoYx2qGwWH3TRNfm3o9WmL7eB3rXC4pbS4rcYp2R29stVbEiTeLsBiBmV5po0LRWIRlGa54sEmWwcTX4rOBd+EbCF5IjKtUr+X6D4l7zZoPxNSACOdwYWkwCBm9jvzltRkw1ZWgivxvrg8ZiQ==;5:sARWii+r84nxhmgdkVifF04h7CjZttZj+fZHGl+71WfUO+1NZmlHHHdqRumZRbn/GK+E3CJVZ/XlD9TUzvns+vkDWpXlpk5mWW36+7sYD4DSxnFh23zet3QdkpSTRUjeCjf3d3Vy3T/TyJFs3DYthg==;24:x58zco3qXiHhIOc94kHRONhRAbtf0DGSwaexJ0oNmEdiBNT8W6sAih4gq8os/lOnXm/OdzRn9HJnSb6OB/gPqNMLuCCHccmnU4w1emq2chI= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;DB6PR0802MB2279;7:qzYQlE+oS6xGVg8dGzl0cLM75KiD+iP0NgbUxezTaNSOnJE4Ra6NfWoInCdgeRwejuKgM7JX3JOh4tps1KHepgfOxUM3WLyTih0rJaITPGzM8Gq3Ed4yMjbBAjJnLYZ3Q2JH/UqJQqt0S285gA5xnTyJGxoUMilYrntyIb3dC8grykVpW2u6Qq0a+3/XtIMeAkQ2fl+KtqVWXpWAOtOjbo7EKKaudA1qUSp6xaAy66N4KSSvC13SEftrhaviSIkjUKxGRF4P9f60+ou2s3yYZbhkOw5JPwaTVGf9wSNan6Qnm4vtZsPT2kuVcHUhRY5e/k6tfRKcslfJwRqiZV3qWA== X-OriginatorOrg: virtuozzo.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Apr 2017 09:13:15.7759 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0802MB2279 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10120 Lines: 280 On 27.04.2017 19:07, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 27.04.2017 18:15, Eric W. Biederman wrote: >>> Kirill Tkhai writes: >>> >>>> On implementing of nested pid namespaces support in CRIU >>>> (checkpoint-restore in userspace tool) we run into >>>> the situation, that it's impossible to create a task with >>>> specific NSpid effectively. After commit 49f4d8b93ccf >>>> "pidns: Capture the user namespace and filter ns_last_pid" >>>> it is impossible to set ns_last_pid on any pid namespace, >>>> except task's active pid_ns (before the commit it was possible >>>> to write to pid_ns_for_children). Thus, if a restored task >>>> in a container has more than one pid_ns levels, the restorer >>>> code must have a task helper for every pid namespace >>>> of the task's pid_ns hierarhy. >>>> >>>> This is a big problem, because of communication with >>>> a helper for every pid_ns in the hierarchy is not cheap. >>>> It's not performance-good as it implies many helpers wakeups >>>> to create a single task (independently, how you communicate >>>> with the helpers). This patch tries to decide the problem. >>> >>> I see the problem and we definitely need to do something. >>> Your patch does appear better than what we have been doing. >>> So a tenative conceptual ack. >>> >>> At the same time it is legitimate to claim that the use of >>> task_active_pid_ns(current) rather than >>> current->nsproxy->pid_ns_for_children is a regression caused by the >>> above commit. So we can fix the original issue as well. >>> >>> I do have to ask when was this problem discovered, and why did it take >>> so long to discover? The regeression happened nearly 5 years ago. >>> >>> Was CRIU already using this? >> >> CRIU uses ns_last_pid, but we never had nested pid namespace hierarchy. >> When there is only one level of pid namespaces, then active pid namespace >> is the save as pid_ns_for_children, so we never faced with this >> problem. > > Ok. So not a regression then. > >> Now we're working on Docker support, and its recent versions create nested >> pid namespaces (I have no information, when they begun to do that). So, >> we met this problem. >> >>> It looks like the fix is a one line low danger change to >>> /proc/sys/kernel/ns_last_pid. With a low danger as pid_ns_for_children >>> rarely differs from task_active_pid_ns(). >>> >>> >>>> It introduces a new pid_ns ioctl(NS_SET_LAST_PID_VEC), >>>> which allows to write a vector of last pids on pid_ns hierarchy. >>>> The vector is passed as array of pids in struct ns_ioc_pid_vec, >>>> written in reverse order. The first number corresponds to >>>> the opened namespace ns_last_pid, the second is to its parent, etc. >>>> So, if you have the pid namespaces hierarchy like: >>>> >>>> pid_ns1 (grand father) >>>> | >>>> v >>>> pid_ns2 (father) >>>> | >>>> v >>>> pid_ns3 (child) >>>> >>>> and the pid_ns3 is open, then the corresponding vector will be >>>> {last_ns_pid3, last_ns_pid2, last_ns_pid1}. This vector may be >>>> short and it may contain less levels. For example, >>>> {last_ns_pid3, last_ns_pid2} or even {last_ns_pid3}, in dependence >>>> of which levels you want to populate. >>>> >>>> v3: Use __u32 in uapi instead of unsigned int. >>>> >>>> v2: Kill pid_ns->child_reaper check as it's impossible to have >>>> such a pid namespace file open. >>>> Use generic namespaces ioctl() number. >>>> Pass pids as array, not as a string. >>>> >>>> Signed-off-by: Kirill Tkhai >>>> --- >>>> fs/nsfs.c | 5 +++++ >>>> include/linux/pid_namespace.h | 12 ++++++++++++ >>>> include/uapi/linux/nsfs.h | 7 +++++++ >>>> kernel/pid_namespace.c | 35 +++++++++++++++++++++++++++++++++++ >>>> 4 files changed, 59 insertions(+) >>>> >>>> diff --git a/fs/nsfs.c b/fs/nsfs.c >>>> index 323f492e0822..f669a1552003 100644 >>>> --- a/fs/nsfs.c >>>> +++ b/fs/nsfs.c >>>> @@ -6,6 +6,7 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>>> #include >>>> #include >>>> >>>> @@ -186,6 +187,10 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl, >>>> argp = (uid_t __user *) arg; >>>> uid = from_kuid_munged(current_user_ns(), user_ns->owner); >>>> return put_user(uid, argp); >>>> + case NS_SET_LAST_PID_VEC: >>>> + if (ns->ops->type != CLONE_NEWPID) >>>> + return -EINVAL; >>>> + return pidns_set_last_pid_vec(ns, (void *)arg); >>>> default: >>>> return -ENOTTY; >>>> } >>>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h >>>> index c2a989dee876..c8dc4173a4e8 100644 >>>> --- a/include/linux/pid_namespace.h >>>> +++ b/include/linux/pid_namespace.h >>>> @@ -9,6 +9,7 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>> >>> No need for the extra include and slowing down the build. Just >>> declare the relevant structures. >> >> So, I'll write just: >> >> struct ns_ioc_pid_vec; >> >> instead of that. >> >>>> >>>> struct pidmap { >>>> atomic_t nr_free; >>>> @@ -103,6 +104,17 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) >>>> } >>>> #endif /* CONFIG_PID_NS */ >>>> >>>> +#if defined(CONFIG_PID_NS) && defined(CONFIG_CHECKPOINT_RESTORE) >>>> +extern long pidns_set_last_pid_vec(struct ns_common *ns, >>>> + struct ns_ioc_pid_vec __user *vec); >>>> +#else /* CONFIG_PID_NS && CONFIG_CHECKPOINT_RESTORE */ >>>> +static inline long pidns_set_last_pid_vec(struct ns_common *ns, >>>> + struct ns_ioc_pid_vec __user *vec) >>>> +{ >>>> + return -ENOTTY; >>>> +} >>>> +#endif /* CONFIG_PID_NS && CONFIG_CHECKPOINT_RESTORE */ >>> >>> Just CONFIG_PID_NS please. Either this is good enough for everyone who >>> has pid namespace support enabled or it isn't. >> >> Great, if it's so. For me it looks better too. >> >>>> extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk); >>>> void pidhash_init(void); >>>> void pidmap_init(void); >>>> diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h >>>> index 1a3ca79f466b..1254b02a47fa 100644 >>>> --- a/include/uapi/linux/nsfs.h >>>> +++ b/include/uapi/linux/nsfs.h >>>> @@ -14,5 +14,12 @@ >>>> #define NS_GET_NSTYPE _IO(NSIO, 0x3) >>>> /* Get owner UID (in the caller's user namespace) for a user namespace */ >>>> #define NS_GET_OWNER_UID _IO(NSIO, 0x4) >>>> +/* Set a vector of ns_last_pid for a pid namespace stack */ >>>> +#define NS_SET_LAST_PID_VEC _IO(NSIO, 0x5) >>>> + >>>> +struct ns_ioc_pid_vec { >>>> + __u32 nr; >>>> + pid_t pid[0]; >>>> +}; >>>> >>>> #endif /* __LINUX_NSFS_H */ >>>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c >>>> index de461aa0bf9a..08b5fef23534 100644 >>>> --- a/kernel/pid_namespace.c >>>> +++ b/kernel/pid_namespace.c >>>> @@ -21,6 +21,7 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>>> >>>> struct pid_cache { >>>> int nr_ids; >>>> @@ -428,6 +429,40 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns) >>>> return &get_pid_ns(pid_ns)->ns; >>>> } >>>> >>>> +#ifdef CONFIG_CHECKPOINT_RESTORE >>>> +long pidns_set_last_pid_vec(struct ns_common *ns, >>>> + struct ns_ioc_pid_vec __user *vec) >>>> +{ >>>> + struct pid_namespace *pid_ns = to_pid_ns(ns); >>>> + pid_t pid, __user *pid_ptr; >>>> + u32 nr; >>>> + >>>> + if (get_user(nr, &vec->nr)) >>>> + return -EFAULT; >>>> + if (nr > 32 || nr < 1) >>> >>> The maximum needs not to be hard coded. >> >> Ah, I've missed MAX_PID_NS_LEVEL, thanks. >> >>>> + return -EINVAL; >>>> + >>>> + pid_ptr = &vec->pid[0]; >>> >>> All of the rest of the vector needs to be read in, in one go. >> >> Hm, Oleg said we shouldn't allocate a memory for that. Should >> I create array of MAX_PID_NS_LEVEL pids on stack? > > *scratches head* > > The really important part is that we perform all of the permission > checks before we perform the rest of the work. > > I would like to be able to say that the permission checks and > the rest of it all happen atomically. Which requires copying the > data into kernel memory and sanitizing it (aka all checks) before > we apply the changes. This way, we better check the permissions on the top pid namespace of the passed vector, because every children's pid_ns->user_ns is the same as its parent's, or it's descendant. > I seem to remember Oleg's primary concern was using vmalloc > and not kmalloc. Using the stack is fine but we need a It's a bit not what he said, but I do not insist. > "BUILD_BUG_ON(sizeof(u32) * MAX_PID_NS_LEVEL < 64);" if we are What does this check mean? Why do we have to limit minimal MAX_PID_NS_LEVEL? > going to do that so we strictly bound the amount of local stack > we are going to use. > > The keep it simple part of me likes the loop copying each value at a > time. The rest of my cringes because that introduces time of use to > time of check issues. Especially if we ever want range checks on > the value of last_pid (which I think we do) we need to be able to > read all of the values in before applying them. > > Eric > > >>>> + do { >>>> + if (!ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN)) >>>> + return -EPERM; >>> >>> Permission to change all of the namespaces should be checked before >>> writing to pid_ns->last_pid happens. >> >> Ok >> >>>> + >>>> + if (get_user(pid, pid_ptr)) >>>> + return -EFAULT; >>>> + if (pid < 0 || pid > pid_max) >>>> + return -EINVAL; >>>> + >>>> + /* Write directly: see the comment in pid_ns_ctl_handler() */ >>>> + pid_ns->last_pid = pid; >>>> + >>>> + pid_ns = pid_ns->parent; >>>> + pid_ptr++; >>>> + } while (--nr > 0 && pid_ns); >>>> + >>>> + return nr ? -EINVAL : 0; >>>> +} >>>> +#endif /* CONFIG_CHECKPOINT_RESTORE */ >>>> + >>>> static struct user_namespace *pidns_owner(struct ns_common *ns) >>>> { >>>> return to_pid_ns(ns)->user_ns;