Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753124AbdLPK24 (ORCPT ); Sat, 16 Dec 2017 05:28:56 -0500 Received: from szxga08-in.huawei.com ([45.249.212.255]:53597 "EHLO huawei.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751036AbdLPK2w (ORCPT ); Sat, 16 Dec 2017 05:28:52 -0500 From: yangjihong To: "dwalsh@redhat.com" , Stephen Smalley , Casey Schaufler , "paul@paul-moore.com" , "eparis@parisplace.org" , "selinux@tycho.nsa.gov" , Lukas Vrabec , Petr Lautrbach CC: "linux-kernel@vger.kernel.org" Subject: =?utf-8?B?562U5aSNOiBbQlVHXWtlcm5lbCBzb2Z0bG9ja3VwIGR1ZSB0byBzaWR0YWJf?= =?utf-8?B?c2VhcmNoX2NvbnRleHQgcnVuIGZvciBsb25nIHRpbWUgYmVjYXVzZSBvZiB0?= =?utf-8?Q?oo_many_sidtab_context_node?= Thread-Topic: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node Thread-Index: AdNz8qjEmN3nXtokQjK18NqjkByILP//38QAgAGjDoCAAAbpgIAABQIAgAAEO4CAAAdrAIAACDAA//7vTXCAAlvJAIAADxQA//4xiMA= Date: Sat, 16 Dec 2017 10:28:45 +0000 Message-ID: <1BC3DBD98AD61A4A9B2569BC1C0B4437D5D6C8@DGGEMM506-MBS.china.huawei.com> References: <1BC3DBD98AD61A4A9B2569BC1C0B4437D5D1F3@DGGEMM506-MBS.china.huawei.com> <1513178296.19161.8.camel@tycho.nsa.gov> <23c51943-51a4-4478-760f-375d02caa39b@schaufler-ca.com> <1513269771.18008.6.camel@tycho.nsa.gov> <79e41bd9-2570-7386-d462-d242a18fb786@schaufler-ca.com> <1513271755.18008.11.camel@tycho.nsa.gov> <33e2eb10-4acc-f9db-a87d-ce63b5b48c1e@schaufler-ca.com> <196418a9-9c9d-fdb5-f5a1-9abc391adc83@redhat.com> <1BC3DBD98AD61A4A9B2569BC1C0B4437D5D398@DGGEMM506-MBS.china.huawei.com> <1513346206.2345.2.camel@tycho.nsa.gov> <1b8709aa-2a08-8cde-13c7-79bb93c791c6@redhat.com> In-Reply-To: <1b8709aa-2a08-8cde-13c7-79bb93c791c6@redhat.com> Accept-Language: zh-CN, en-US Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.40.22.126] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id vBGAT4OQ031679 Content-Length: 13299 Lines: 280 >On 12/15/2017 08:56 AM, Stephen Smalley wrote: >> On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote: >>> On 12/15/2017 10:31 PM, yangjihong wrote: >>>> On 12/14/2017 12:42 PM, Casey Schaufler wrote: >>>>> On 12/14/2017 9:15 AM, Stephen Smalley wrote: >>>>>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote: >>>>>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote: >>>>>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote: >>>>>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote: >>>>>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote: >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to >>>>>>>>>>> constantly starting numbers of docker ontainers with selinux >>>>>>>>>>> enabled, and after about 2 days, the kernel softlockup panic: >>>>>>>>>>> [] >>>>>>>>>>> sched_show_task+0xb8/0x120 >>>>>>>>>>> [] show_lock_info+0x20f/0x3a0 >>>>>>>>>>> [] watchdog_timer_fn+0x1da/0x2f0 >>>>>>>>>>> [] ? >>>>>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40 >>>>>>>>>>> [] >>>>>>>>>>> __hrtimer_run_queues+0xd2/0x260 >>>>>>>>>>> [] hrtimer_interrupt+0xb0/0x1e0 >>>>>>>>>>> [] >>>>>>>>>>> local_apic_timer_interrupt+0x37/0x60 >>>>>>>>>>> [] >>>>>>>>>>> smp_apic_timer_interrupt+0x50/0x140 >>>>>>>>>>> [] apic_timer_interrupt+0x6d/0x80 >>>>>>>>>>> [] ? >>>>>>>>>>> sidtab_context_to_sid+0xb3/0x480 >>>>>>>>>>> [] ? >>>>>>>>>>> sidtab_context_to_sid+0x110/0x480 >>>>>>>>>>> [] ? >>>>>>>>>>> mls_setup_user_range+0x145/0x250 >>>>>>>>>>> [] >>>>>>>>>>> security_get_user_sids+0x3f7/0x550 >>>>>>>>>>> [] sel_write_user+0x12b/0x210 >>>>>>>>>>> [] ? sel_write_member+0x200/0x200 >>>>>>>>>>> [] >>>>>>>>>>> selinux_transaction_write+0x48/0x80 >>>>>>>>>>> [] vfs_write+0xbd/0x1e0 >>>>>>>>>>> [] SyS_write+0x7f/0xe0 >>>>>>>>>>> [] system_call_fastpath+0x16/0x1b >>>>>>>>>>> >>>>>>>>>>> My opinion: >>>>>>>>>>> when the docker container starts, it would mount overlay >>>>>>>>>>> filesystem with different selinux context, mount point such >>>>>>>>>>> as: >>>>>>>>>>> overlay on >>>>>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea >>>>>>>>>>> e4f6cb0f >>>>>>>>>>> 07b4 >>>>>>>>>>> bc32 >>>>>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay >>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox >>>>>>>>>>> _file_t: >>>>>>>>>>> s0:c >>>>>>>>>>> 414, >>>>>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV >>>>>>>>>>> 5CFWLADP >>>>>>>>>>> ARHH >>>>>>>>>>> WY7: >>>>>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS >>>>>>>>>>> :/var/li >>>>>>>>>>> b/do >>>>>>>>>>> cker >>>>>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/ >>>>>>>>>>> lib/dock >>>>>>>>>>> er/o >>>>>>>>>>> verl >>>>>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07 >>>>>>>>>>> 495ca08f >>>>>>>>>>> c9dd >>>>>>>>>>> b66/ >>>>>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f >>>>>>>>>>> c4530e0e >>>>>>>>>>> 952e >>>>>>>>>>> ae4f >>>>>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work) >>>>>>>>>>> shm on >>>>>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755 >>>>>>>>>>> 793449c9 >>>>>>>>>>> 1327 >>>>>>>>>>> ca57 >>>>>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs >>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob >>>>>>>>>>> ject_r:s >>>>>>>>>>> virt >>>>>>>>>>> _san >>>>>>>>>>> dbox_file_t:s0:c414,c873",size=65536k) >>>>>>>>>>> overlay on >>>>>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d02 >>>>>>>>>>> 55991dfb >>>>>>>>>>> 7258 >>>>>>>>>>> cbca >>>>>>>>>>> 14ff6d165b94353eefab/merged type overlay >>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox >>>>>>>>>>> _file_t: >>>>>>>>>>> s0:c >>>>>>>>>>> 431, >>>>>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF >>>>>>>>>>> B7ANVRHP >>>>>>>>>>> AVRC >>>>>>>>>>> RSS: >>>>>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI >>>>>>>>>>> ,upperdi >>>>>>>>>>> r=/v >>>>>>>>>>> ar/l >>>>>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991d >>>>>>>>>>> fb7258cb >>>>>>>>>>> ca14 >>>>>>>>>>> ff6d >>>>>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/ >>>>>>>>>>> 38d1544d >>>>>>>>>>> 0801 >>>>>>>>>>> 45c7 >>>>>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work >>>>>>>>>>> ) >>>>>>>>>>> shm on >>>>>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944 >>>>>>>>>>> 537a4bce >>>>>>>>>>> dc1d >>>>>>>>>>> cf05 >>>>>>>>>>> a65866458523ffd4a71614/shm type tmpfs >>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob >>>>>>>>>>> ject_r:s >>>>>>>>>>> virt >>>>>>>>>>> _san >>>>>>>>>>> dbox_file_t:s0:c431,c651",size=65536k) >>>>>>>>>>> >>>>>>>>>>> sidtab_search_context check the context whether is in the >>>>>>>>>>> sidtab list, If not found, a new node is generated and insert >>>>>>>>>>> into the list, As the number of containers is increasing, >>>>>>>>>>> context nodes are also more and more, we tested the final >>>>>>>>>>> number of nodes reached >>>>>>>>>>> 300,000 +, >>>>>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will >>>>>>>>>>> lead to the system softlockup. >>>>>>>>>>> >>>>>>>>>>> Is this a selinux bug? When filesystem umount, why context >>>>>>>>>>> node is not deleted? I cannot find the relevant function to >>>>>>>>>>> delete the node in sidtab.c >>>>>>>>>>> >>>>>>>>>>> Thanks for reading and looking forward to your reply. >>>>>>>>>> So, does docker just keep allocating a unique category set for >>>>>>>>>> every new container, never reusing them even if the container >>>>>>>>>> is destroyed? >>>>>>>>>> That would be a bug in docker IMHO. Or are you creating an >>>>>>>>>> unbounded number of containers and never destroying the older >>>>>>>>>> ones? >>>>>>>>> You can't reuse the security context. A process in ContainerA >>>>>>>>> sends a labeled packet to MachineB. ContainerA goes away and >>>>>>>>> its context is recycled in ContainerC. MachineB responds some >>>>>>>>> time later, again with a labeled packet. ContainerC gets >>>>>>>>> information intended for ContainerA, and uses the information >>>>>>>>> to take over the Elbonian government. >>>>>>>> Docker isn't using labeled networking (nor is anything else by >>>>>>>> default; it is only enabled if explicitly configured). >>>>>>> If labeled networking weren't an issue we'd have full security >>>>>>> module stacking by now. Yes, it's an edge case. If you want to >>>>>>> use labeled NFS or a local filesystem that gets mounted in each >>>>>>> container (don't tell me that nobody would do that) you've got >>>>>>> the same problem. >>>>>> Even if someone were to configure labeled networking, Docker is >>>>>> not presently relying on that or SELinux network enforcement for >>>>>> any security properties, so it really doesn't matter. >>>>> True enough. I can imagine a use case, but as you point out, it >>>>> would be a very complex configuration and coordination exercise >>>>> using SELinux. >>>>> >>>>>> And if they wanted >>>>>> to do that, they'd have to coordinate category assignments across >>>>>> all systems involved, for which no facility exists AFAIK. If you >>>>>> have two docker instances running on different hosts, I'd wager >>>>>> that they can hand out the same category sets today to different >>>>>> containers. >>>>>> >>>>>> With respect to labeled NFS, that's also not the default for nfs >>>>>> mounts, so again it is a custom configuration and Docker isn't >>>>>> relying on it for any guarantees today. For local filesystems, >>>>>> they would normally be context-mounted or using genfscon rather >>>>>> than xattrs in order to be accessible to the container, thus no >>>>>> persistent storage of the category sets. >>>> Well Kubernetes and OpenShift do set the labels to be the same >>>> within a project, and they can manage across nodes. But yes we are >>>> not using labeled networking at this point. >>>>> I know that is the intended configuration, but I see people do all >>>>> sorts of stoopid things for what they believe are good reasons. >>>>> Unfortunately, lots of people count on containers to provide >>>>> isolation, but create "solutions" for data sharing that defeat it. >>>>> >>>>>> Certainly docker could provide an option to not reuse category >>>>>> sets, but making that the default is not sane and just guarantees >>>>>> exhaustion of the SID and context space (just create and tear down >>>>>> lots of containers every day or more frequently). >>>>> It seems that Docker might have a similar issue with UIDs, but it >>>>> takes longer to run out of UIDs than sidtab entries. >>>>> >>>>>>>>>> On the selinux userspace side, we'd also like to eliminate the >>>>>>>>>> use of /sys/fs/selinux/user (sel_write_user -> >>>>>>>>>> security_get_user_sids) entirely, which is what triggered this >>>>>>>>>> for you. >>>>>>>>>> >>>>>>>>>> We cannot currently delete a sidtab node because we have no >>>>>>>>>> way of knowing if there are any lingering references to the >>>>>>>>>> SID. >>>>>>>>>> Fixing that would require reference-counted SIDs, which goes >>>>>>>>>> beyond just SELinux since SIDs/secids are returned by LSM >>>>>>>>>> hooks and cached in other kernel data structures. >>>>>>>>> You could delete a sidtab node. The code already deals with >>>>>>>>> unfindable SIDs. The issue is that eventually you run out of >>>>>>>>> SIDs. >>>>>>>>> Then you are forced to recycle SIDs, which leads to the >>>>>>>>> overthrow of the Elbonian government. >>>>>>>> We don't know when we can safely delete a sidtab node since SIDs >>>>>>>> aren't reference counted and we can't know whether it is still >>>>>>>> in use somewhere in the kernel. Doing so prematurely would lead >>>>>>>> to the SID being remapped to the unlabeled context, and then >>>>>>>> likely to undesired denials. >>>>>>> I would suggest that if you delete a sidtab node and someone >>>>>>> comes along later and tries to use it that denial is exactly what >>>>>>> you would desire. I don't see any other rational action. >>>>>> Yes, if we know that the SID wasn't in use at the time we tore it >>>>>> down. >>>>>> But if we're just randomly deleting sidtab entries based on age >>>>>> or something (since we have no reference count), we'll almost >>>>>> certainly encounter situations where a SID hasn't been accessed in >>>>>> a long time but is still being legitimately cached somewhere. >>>>>> Just a file that hasn't been accessed in a while might have that >>>>>> SID still cached in its inode security blob, or anywhere else. >>>>>> >>>>>>>>>> sidtab_search_context() could no doubt be optimized for the >>>>>>>>>> negative case; there was an earlier optimization for the >>>>>>>>>> positive case by adding a cache to sidtab_context_to_sid() >>>>>>>>>> prior to calling it. It's a reverse lookup in the sidtab. >>>>>>>>> This seems like a bad idea. >>>>>>>> Not sure what you mean, but it can certainly be changed to at >>>>>>>> least use a hash table for these reverse lookups. >>>>>>>> >>>>>>>> >>>>> >>>>> >>> Thanks for reply and discussion. >>> I think docker container is only a case, Is it possible there is a >>> similar way, through some means of attack, triggered a constantly >>> increasing of SIDs list, eventually leading to the system panic? >>> >>> I think the issue is that is takes too long to search SID node when >>> SIDs list too large, If can optimize the node's data structure(ie : >>> tree structure) or search algorithm to ensure that traversing all >>> nodes can be very short time even in many nodes, maybe it can solve >>> the problem. >>> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount >>> fs, delete the SID node. Because when fs is umounted, the SID is >>> useless, could delete it to control the size of SIDs list. >>> >>> Thanks for reading and looking forward to your reply. >> We cannot safely delete entries in the sidtab without first adding >> reference counting of SIDs, which goes beyond just SELinux since they >> are cached in other kernel data structures and returned by LSM hooks. >> That's a non-trivial undertaking. >> >> Far more practical in the near term would be to introduce a hash table >> or other mechanism for efficient reverse lookups in the sidtab. Are >> you offering to implement that or just requesting it? >> Because I'm not very familiar with the overall architecture of selinux, so may be could not offer to implement, sorry. Or please tell me what I can do if I can help. If there is any progress(ie determine the solution or optimization method), could you please inform me about it? thanks! >> Independent of that, docker should support reuse of category sets when >> containers are deleted, at least as an option and probably as the >> default. >> >> >Docker does reuse categories of containers that are removed, by default. Thanks for reading and looking forward to your reply. Best wishes!