Received: by 10.213.65.68 with SMTP id h4csp10001imn; Thu, 15 Mar 2018 14:47:44 -0700 (PDT) X-Google-Smtp-Source: AG47ELug7EfbOoMlpnykFdmuOGxxly3cVyQ18Eakj6YPTZKPUK8Wjlj/9r1kWhpJAs8HDJEKbYLP X-Received: by 2002:a17:902:7509:: with SMTP id i9-v6mr9783558pll.220.1521150464138; Thu, 15 Mar 2018 14:47:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521150464; cv=none; d=google.com; s=arc-20160816; b=OZYtX6miVuUD4FWZEjESCNokyx91XqbZ/XVAm8sa1+mCYVzwGQprB/1vwHEhBJwexF gWtO22ih3efPJXZQ9eS/1xh+myoDNPSPXkzPNd3gELPMODSgb2WjJiAPJjeVuI4vPUlr qqs+SU63GgMty6gf9lPp5CmHS1WLGJMdKIEblc5g1vnV9bX8wjA1eAWhgpnQBhibyhJe noMVVT11En1QJnGtdp9DlujkMPb2/RIm/E4o4sJBh6Sc/HiSGO29NCduUXDxN5yn8k26 vN3bSpIgnceaPynMdF5xvrOCEVVRohm4bzFj9Gzd5twXiNo/lwgm28+1iacnb5K2/1u/ gzNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=I0sMt6WTuc/mxiz7u3iUOfSq52QmUvfFlVL68nab2wg=; b=XKnBNN4CO+dGHS76S2l6N7xUdKmvOjtouv9PfVj0v25Mk7ZrFw0+JVdfqewEkcUNGe uCMVaUvE+NwUVa3WveSfp+ITtKOExhMub1Ehm1oV8wE/RCrI/5kekBlfVhmZ/Ul1rvE7 66yLGsMx/R6x3tfa4nlqo6aoDmmBlbYHpxGNW5VPPjrHxL5xfSWyrA5mNyLdyP3JAvyn ++J50Cpu5AJ2y/Wlvu6E47XFrnc3TmCDHZPrreUHM0HdoCCyNsHitJ9J/tdTYYsH3E82 k76TwzToqn5U3syXhaWY9Nm5zzgYIeLSBxA00JALDuSsmK41znm/Bp5A5ScNmxbXMLc8 udiA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m68si4474648pfm.99.2018.03.15.14.47.28; Thu, 15 Mar 2018 14:47:44 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752623AbeCOVqZ convert rfc822-to-8bit (ORCPT + 99 others); Thu, 15 Mar 2018 17:46:25 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:57728 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751466AbeCOVqY (ORCPT ); Thu, 15 Mar 2018 17:46:24 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 76A7E20CC6; Thu, 15 Mar 2018 21:46:23 +0000 (UTC) Received: from llong.remote.csb (dhcp-17-75.bos.redhat.com [10.18.17.75]) by smtp.corp.redhat.com (Postfix) with ESMTP id EFECD84451; Thu, 15 Mar 2018 21:46:21 +0000 (UTC) Subject: Re: [RFC][PATCH] ipc: Remove IPCMNI To: "Eric W. Biederman" Cc: "Luis R. Rodriguez" , Kees Cook , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Andrew Morton , Al Viro , Matthew Wilcox , Stanislav Kinsbursky , Linux Containers References: <1520885744-1546-1-git-send-email-longman@redhat.com> <1520885744-1546-5-git-send-email-longman@redhat.com> <87woyfyh57.fsf@xmission.com> <5d4a858a-3136-5ef4-76fe-a61e7f2aed56@redhat.com> <87o9jru3bf.fsf@xmission.com> <935a7c50-50cc-2dc0-33bb-92c000d039bc@redhat.com> <87woyego2u.fsf_-_@xmission.com> <047c6ed6-6581-b543-ba3d-cadc543d3d25@redhat.com> <87h8ph6u67.fsf@xmission.com> From: Waiman Long Organization: Red Hat Message-ID: <7d3a1f93-f8e5-5325-f9a7-0079f7777b6f@redhat.com> Date: Thu, 15 Mar 2018 17:46:21 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <87h8ph6u67.fsf@xmission.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Thu, 15 Mar 2018 21:46:23 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Thu, 15 Mar 2018 21:46:23 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'longman@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/15/2018 03:00 PM, Eric W. Biederman wrote: > Waiman Long writes: > >> On 03/14/2018 08:49 PM, Eric W. Biederman wrote: >>> The define IPCMNI was originally the size of a statically sized array in >>> the kernel and that has long since been removed. Therefore there is no >>> fundamental reason for IPCMNI. >>> >>> The only remaining use IPCMNI serves is as a convoluted way to format >>> the ipc id to userspace. It does not appear that anything except for >>> the CHECKPOINT_RESTORE code even cares about this variety of assignment >>> and the CHECKPOINT_RESTORE code only cares about this weirdness because >>> it has to restore these peculiar ids. >>> >>> Therefore make the assignment of ipc ids match the description in >>> Advanced Programming in the Unix Environment and assign the next id >>> until INT_MAX is hit then loop around to the lower ids. >>> >>> This can be implemented trivially with the current code using idr_alloc_cyclic. >>> >>> To make it possible to keep checkpoint/restore working I have renamed >>> the sysctls from xxx_next_id to xxx_nextid. That is enough change that >>> a smart CRIU implementation can see that what is exported has changed, >>> and act accordingly. New kernels will be able to restore the old id's. >>> >>> This code still needs some real world testing to verify my assumptions. >>> And some work with the CRIU implementations to actually add the code >>> that deals with the new for of id assignment. >>> >>> Updates: 03f595668017 ("ipc: add sysctl to specify desired next object id") >>> Signed-off-by: "Eric W. Biederman" >>> --- >>> >>> Waiman please take a look at this and run it through some tests etc, >>> I am pretty certain something like this patch is all you need to do >>> to sort out ipc assignment. Not messing with sysctls needed. >>> >>> include/linux/ipc.h | 2 -- >>> include/linux/ipc_namespace.h | 1 - >>> ipc/ipc_sysctl.c | 6 ++-- >>> ipc/namespace.c | 11 ++---- >>> ipc/util.c | 80 ++++++++++--------------------------------- >>> ipc/util.h | 11 +----- >>> 6 files changed, 25 insertions(+), 86 deletions(-) >>> >>> diff --git a/include/linux/ipc.h b/include/linux/ipc.h >>> index 821b2f260992..6cc2df7f7ac9 100644 >>> --- a/include/linux/ipc.h >>> +++ b/include/linux/ipc.h >>> @@ -8,8 +8,6 @@ >>> #include >>> #include >>> >>> -#define IPCMNI 32768 /* <= MAX_INT limit for ipc arrays (including sysctl changes) */ >>> - >>> /* used by in-kernel data structures */ >>> struct kern_ipc_perm { >>> spinlock_t lock; >>> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h >>> index b5630c8eb2f3..cab33b6a8236 100644 >>> --- a/include/linux/ipc_namespace.h >>> +++ b/include/linux/ipc_namespace.h >>> @@ -15,7 +15,6 @@ struct user_namespace; >>> >>> struct ipc_ids { >>> int in_use; >>> - unsigned short seq; >>> bool tables_initialized; >>> struct rw_semaphore rwsem; >>> struct idr ipcs_idr; >>> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c >>> index 8ad93c29f511..a599963d58bf 100644 >>> --- a/ipc/ipc_sysctl.c >>> +++ b/ipc/ipc_sysctl.c >>> @@ -176,7 +176,7 @@ static struct ctl_table ipc_kern_table[] = { >>> }, >>> #ifdef CONFIG_CHECKPOINT_RESTORE >>> { >>> - .procname = "sem_next_id", >>> + .procname = "sem_nextid", >>> .data = &init_ipc_ns.ids[IPC_SEM_IDS].next_id, >>> .maxlen = sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id), >>> .mode = 0644, >>> @@ -185,7 +185,7 @@ static struct ctl_table ipc_kern_table[] = { >>> .extra2 = &int_max, >>> }, >>> { >>> - .procname = "msg_next_id", >>> + .procname = "msg_nextid", >>> .data = &init_ipc_ns.ids[IPC_MSG_IDS].next_id, >>> .maxlen = sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id), >>> .mode = 0644, >>> @@ -194,7 +194,7 @@ static struct ctl_table ipc_kern_table[] = { >>> .extra2 = &int_max, >>> }, >>> { >>> - .procname = "shm_next_id", >>> + .procname = "shm_nextid", >>> .data = &init_ipc_ns.ids[IPC_SHM_IDS].next_id, >>> .maxlen = sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id), >>> .mode = 0644, >> So you are changing the names of existing sysctl parameters. Will it be >> better to add new sysctl to indicate that the rule has changed >> instead? > In practice I am replacing one set of sysctls with another, that work > very similarly but not quite the same. As we can't keep the existing > semantics removing the old sysctl seems correct. Likewise adding > a new sysctl with slightly changed semantics seems correct. > > This needs an accompanying patch to CRIU to see which sysctls are > available and to change it's behavior based upon that. The practical > question is what makes it easiest not to confuse CRIU. > > Not having the sysctl should be something that CRIU detects today > and the old versions should fail gracefully. But testing is needed. > Adding a new sysctl to say the behavior has changed and reusing the > old names won't have the same effect of disabling existing versions > of CRIU. That is fine as long as CRIU is the only user. > >> I don't know the history why the id management of SysV IPC was designed >> in such a convoluted way, but the patch does make sense to me. > I don't have the full history and we might wind up finding more as we > run this patch through it's paces. > > The earliest history I know is what I read in Advanced Programming in > the Unix Environment (which predates linux). It described the ipc ids > as assigned from a counter that wraps. I thought like my patch > implements. On closer reading it has a counter that increases each time > the slot is used, and then wraps. Exactly like Linux before my patch. > *Grrr* > > The existing structure of the bifurcated is present in Linux 1.0. At > that time SHMMNI was 256. SHMMNI was the size of a static array of shm > segments. The high 24 bits held a sequence number that was incremented > when a segment was removed at the time. Presumably the upper bits were > incremented to avoid swiftly reusing the same shm ids. > > Hmm. I took a quick look at FreeBSD10 and it has the exact same split > in the id. So userspace may actually depend upon that split. Backward compatibility is the part that I am most worry about this patch. That is also the reason I asked why the ID is generated in such a way. My original thinking was to have an extended mode where the IPCMNI becomes 8M from 32k. That will reduce the sequence number from 16 bits to 8 bits. The extended mode is enabled by adding, for example, a boot option. So this will be an opt-in feature instead of as a default. > > Which comes down to the fundamental question what depends upon what. > How do other operating systems like Solaris handle this? I don't know how Solaris handle this, but I know they support up to 2^24 shm segments. > > Does any nix flavor support more that 16bits worth of shm segments? > > The API has been deprecated for the last 20 years and we are still > keeping it alive. Sigh. > > Still there is fundamentally only one thing the kernel can do if we wish > to increase the number of shm segments. > > Please take my patch and test it out and see if you can find anything > that cares about the change. Except for needing id reuse to be > infrequent I can not imagine that there is anything that cares. > > It could very reasonably be argued that my when shmmni is < INT_MAX > my patch implements a version of the existing algorithm. As we go > through all of the possible ids before we reuse any of them. > > Eric > Thanks for the patch, I am still thinking about what is the best way to handle this. Cheers, Longman