Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp3758981rwd; Mon, 22 May 2023 20:22:53 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5hiT2WY3A9YGvc2zbj/d6lXpclHbsmYh4yDTL0ARaX1+y4tIB3gCPp1lpnOz8e97v2PY6+ X-Received: by 2002:a05:6a20:4282:b0:10c:6:61d1 with SMTP id o2-20020a056a20428200b0010c000661d1mr4011739pzj.39.1684812173751; Mon, 22 May 2023 20:22:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684812173; cv=none; d=google.com; s=arc-20160816; b=uh6TCzmm/nDrPgdlPM3T5MnboGaZB4hQADyZAErreKAYC7HwZNfbLuBXGzdZ+3n55/ JERFgesWozJmQ0Gvkf8y9NLt/WDJsdSEuG9JfowV5JDbTQhpvNyNodYyBDtcIWQKDATg 9WPWw1kCu7rzFInVTn/GbCt1prYSh25s2551wyp7iJM0FJN9UGO3agGSbeFEB2+8zlmX gZC4vCQArytNhnYQeP7w1MuCjFZpItfkdeZarjoGKWw6S6prCWFPYD6bdZpwdz74lb03 vnalizaCMelGJJmSdak7m2fqK88wwB/zfZaJMDWmvLN/HNCwfZhQkQewvagnnyzNysCU R/Nw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:to:subject:user-agent:mime-version:date :message-id; bh=08GWe9fiUo8ejpb05y9STK5k4UGLnjMHZx1EJG2Hzvg=; b=hGXMvMdB89XANWAFQbOkz16aZkMvoUZPsex5FI2is+Ex/gB+ojwxf7VL5Zo8vyd6uq I3HJLqa8rTZEJPYbsOc6ZgMwzue71W+ncWBgIHZKMQ4nOvlPf02AZZkgTh5EM/E1k+QP +sBZfAZBVodBtLfDgnWXXB51hUWaDA29xKKa8j8U16qFg7czBf9OZ7mYeyS1Vv+FCX5H AlRh2iMPcon6/ZP+SXjiVc1O3oqnjb9Yk7vxHGJwbnJNIRPC3lLY7WBuY47bc1mgjtdq b0/DUfqjFJfL1OnsD/t+mgyA1bDhE3jwdjxUweh+ra0EL9eErA/bvh7b+l/vylRYCqQ2 ZzlQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a21-20020a63e855000000b00524ea62bcbcsi5709070pgk.209.2023.05.22.20.22.38; Mon, 22 May 2023 20:22:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231825AbjEWDFt (ORCPT + 99 others); Mon, 22 May 2023 23:05:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36206 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229448AbjEWDFr (ORCPT ); Mon, 22 May 2023 23:05:47 -0400 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4ECF98F; Mon, 22 May 2023 20:05:45 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045170;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VjIII9u_1684811140; Received: from 30.240.108.216(mailfrom:tianjia.zhang@linux.alibaba.com fp:SMTPD_---0VjIII9u_1684811140) by smtp.aliyun-inc.com; Tue, 23 May 2023 11:05:42 +0800 Message-ID: <60f6f1f0-4918-5fea-9827-9bf9d1e496e3@linux.alibaba.com> Date: Tue, 23 May 2023 11:05:38 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN To: Casey Schaufler , Serge Hallyn , Paul Moore , Stephen Smalley , Eric Paris , Frederick Lawler , Jens Axboe , Joseph Qi , linux-security-module@vger.kernel.org, selinux@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, louxiao.lx@alibaba-inc.com References: <20230511070520.72939-1-tianjia.zhang@linux.alibaba.com> <345a7cdc-e55b-7aaa-43d4-59b3f911ef18@linux.alibaba.com> Content-Language: en-US From: Tianjia Zhang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.0 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/23/23 3:13 AM, Casey Schaufler wrote: > On 5/21/2023 7:53 PM, Tianjia Zhang wrote: >> Hi Casey, >> >> On 5/18/23 8:01 AM, Casey Schaufler wrote: >>> On 5/16/2023 5:05 AM, Tianjia Zhang wrote: >>>> Hi Casey, >>>> >>>> On 5/12/23 12:17 AM, Casey Schaufler wrote: >>>>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote: >>>>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN. >>>>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is >>>>>> included >>>>>> within CAP_SYS_ADMIN. >>>>>> >>>>>> Some database products rely on shared storage to complete the >>>>>> write-once-read-multiple and write-multiple-read-multiple functions. >>>>>> When HA occurs, they rely on the PR (Persistent Reservations) >>>>>> protocol >>>>>> provided by the storage layer to manage block device permissions to >>>>>> ensure data correctness. >>>>>> >>>>>> CAP_SYS_ADMIN is required in the PR protocol implementation of >>>>>> existing >>>>>> block devices in the Linux kernel, which has too many sensitive >>>>>> permissions, which may lead to risks such as container escape. The >>>>>> kernel needs to provide more fine-grained permission management like >>>>>> CAP_NET_ADMIN to avoid online products directly relying on root to >>>>>> run. >>>>>> >>>>>> CAP_BLOCK_ADMIN can also provide support for other block device >>>>>> operations that require CAP_SYS_ADMIN capabilities in the future, >>>>>> ensuring that applications run with least privilege. >>>>> >>>>> Can you demonstrate that there are cases where a program that needs >>>>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other >>>>> operations? >>>>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by >>>>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to >>>>> justify. >>>>> >>>> >>>> For the previous non-container scenarios, the block device is a shared >>>> device, because the business-system generally operates the file system >>>> on the block. Therefore, directly operating the block device has a high >>>> probability of affecting other processes on the same host, and it is a >>>> reasonable requirement to need the CAP_SYS_ADMIN capability. >>>> >>>> But for a database running in a container scenario, especially a >>>> container scenario on the cloud, it is likely that a container >>>> exclusively occupies a block device. That is to say, for a container, >>>> its access to the block device will not affect other process, there is >>>> no need to obtain a higher CAP_SYS_ADMIN capability. >>> >>> If I understand correctly, you're saying that the process that requires >>> CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for >>> other operations. >>> >>> That's good, but it isn't clear how a process on bare metal would >>> require CAP_SYS_ADMIN while the same process in a container wouldn't. >>> >>>> >>>> For a file system similar to distributed write-once-read-many, it is >>>> necessary to ensure the correctness of recovery, then when recovery >>>> occurs, it is necessary to ensure that no inflighting-io is completed >>>> after recovery. >>>> >>>> This can be guaranteed by performing operations such as SCSI/NVME >>>> Persistent Reservations on block devices on the distributed file >>>> system. >>> >>> Does your cloud based system always run "real" devices? My >>> understanding is that cloud based deployment usually uses >>> virtual machines and virtio or other simulated devices. >>> A container deployment in the cloud seems unlikely to be able >>> to take advantage of block administration. But I can't say >>> I know the specifics of your environment. >>> >>>> Therefore, at present, it is only necessary to have the relevant >>>> permission support of the control command of such container-exclusive >>>> block devices. >>> >>> This looks like an extremely special case in which breaking out >>> block management would make sense. >>> >> Our scenario is like this. In simply terms, a distributed database has >> a read-write instance and one or more read-only instances. Each instance >> runs in an isolated container. All containers share the same block >> device. >> >> In addition to the database instance, there is also a control program >> running on the control plane in the container. The database ensures >> the correctness of the data through the PR (Persistent Reservations) >> of the block device. This operation is also the only operation in the >> container that requires CAP_SYS_ADMIN privileges. >> >> This system as a whole, whether it is running on VM or bare metal, the >> difference is not big. >> >> In order to support the PR of block devices, we need to grant >> CAP_SYS_ADMIN permissions to the container, which not only greatly >> increases the risk of container escape, but also makes us have to >> carefully configure the permissions of the container. Many container >> escapes that have occurred are also caused by these reasons. >> >> This is essentially a problem of permission isolation. We hope to >> share the smallest possible permissions from CAP_SYS_ADMIN to support >> necessary operations, and avoid providing CAP_SYS_ADMIN permissions >> to containers as much as possible. > > Your use case is interesting, but not compelling. While you may have > come up with a specific case where you can completely break CAP_BLOCK_ADMIN > out from CAP_SYS_ADMIN, it's hardly general. > It sounds a pity, thanks for your reply, we try to provide support through self-developed patches first. Kind regards, Tianjia