2017-12-13 09:25:31

by Yang Jihong

[permalink] [raw]
Subject: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

Hello,

I am doing stressing testing on 3.10 kernel(centos 7.4), to constantly starting numbers of docker ontainers with selinux enabled, and after about 2 days, the kernel softlockup panic:
<IRQ> [<ffffffff810bb778>] sched_show_task+0xb8/0x120
[<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
[<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
[<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
[<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
[<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
[<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
[<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
[<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
[<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
[<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
[<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
[<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
[<ffffffff812b1960>] ? sel_write_member+0x200/0x200
[<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
[<ffffffff811f444d>] vfs_write+0xbd/0x1e0
[<ffffffff811f4eef>] SyS_write+0x7f/0xe0
[<ffffffff8166d433>] system_call_fastpath+0x16/0x1b

My opinion:
when the docker container starts, it would mount overlay filesystem with different selinux context, mount point such as:
overlay on /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/merged type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHHWY7:/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca577b8f5d9d6a4adf218d4876/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,c873",size=65536k)
overlay on /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d165b94353eefab/merged type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRCRSS:/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05a65866458523ffd4a71614/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,c651",size=65536k)

sidtab_search_context check the context whether is in the sidtab list, If not found, a new node is generated and insert into the list, As the number of containers is increasing, context nodes are also more and more, we tested the final number of nodes reached 300,000 +, sidtab_context_to_sid runtime needs 100-200ms, which will lead to the system softlockup.

Is this a selinux bug? When filesystem umount, why context node is not deleted? I cannot find the relevant function to delete the node in sidtab.c

Thanks for reading and looking forward to your reply.


2017-12-13 15:28:30

by Stephen Smalley

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
> Hello, 
>
> I am doing stressing testing on 3.10 kernel(centos 7.4), to
> constantly starting numbers of docker ontainers with selinux enabled,
> and after about 2 days, the kernel softlockup panic:
>  <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>  [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
>  [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>  [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>  [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>  <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
>  [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>  [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>  [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>  [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>
> My opinion:
> when the docker container starts, it would mount overlay filesystem
> with different selinux context, mount point such as: 
> overlay on
> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc32
> 6cb07495ca08fc9ddb66/merged type overlay
> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,
> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHHWY7:
> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker
> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overl
> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/
> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f
> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
> shm on
> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca57
> 7b8f5d9d6a4adf218d4876/shm type tmpfs
> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
> dbox_file_t:s0:c414,c873",size=65536k)
> overlay on
> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca
> 14ff6d165b94353eefab/merged type overlay
> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,
> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRCRSS:
> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/l
> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d
> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7
> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
> shm on
> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05
> a65866458523ffd4a71614/shm type tmpfs
> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
> dbox_file_t:s0:c431,c651",size=65536k)
>
> sidtab_search_context check the context whether is in the sidtab
> list, If not found, a new node is generated and insert into the list,
> As the number of containers is increasing,  context nodes are also
> more and more, we tested the final number of nodes reached 300,000 +,
> sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
> system softlockup.
>
> Is this a selinux bug? When filesystem umount, why context node is
> not deleted?  I cannot find the relevant function to delete the node
> in sidtab.c
>
> Thanks for reading and looking forward to your reply.

So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is destroyed?
That would be a bug in docker IMHO. Or are you creating an unbounded
number of containers and never destroying the older ones?

On the selinux userspace side, we'd also like to eliminate the use of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.

We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the SID. Fixing that
would require reference-counted SIDs, which goes beyond just SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.

sidtab_search_context() could no doubt be optimized for the negative
case; there was an earlier optimization for the positive case by adding
a cache to sidtab_context_to_sid() prior to calling it. It's a reverse
lookup in the sidtab.

2017-12-14 03:19:26

by Yang Jihong

[permalink] [raw]
Subject: 答复: [BUG]kernel softlockup due to sidtab_ search_context run for long time because of t oo many sidtab context node

Hello,

> So, does docker just keep allocating a unique category set for every new container, never reusing them even if the container is destroyed?
> That would be a bug in docker IMHO. Or are you creating an unbounded number of containers and never destroying the older ones?
I creat a containers, then destroy it, and create second one, destroy it.......
When docker created, it will mount overlay fs, because every containers has different selinux context, so a new sidtab node is generated and insert into the sidtab list
When docker destroyed, it will umount overlay fs, but umount operation does not seem relevant to "delete the node" hooks function, resulting in longer and longer sidtab list
I think when umount, its selinux context will never reuse, so sidtab node is useless, it is best to delete it


> sidtab_search_context() could no doubt be optimized for the negative case; there was an earlier optimization for the positive case by adding a cache to sidtab_context_to_sid() prior to calling it. It's a reverse lookup in the sidtab.
I think add cache may be not very userful, because every containers has different selinux context, so when one docker created, it will search the whole sidtab list, until compare the last node, When a new node arrives, it is always necessary to compare all the nodes first, and then insert.
All as long as the list does not delete the node, list will always increase, and search time will longer and longer, eventually leading to softlockup


Is there any solution to this problem?
Thanks for reading and looking forward to your reply.

Best wishes!

-----邮件原件-----
发件人: Stephen Smalley [mailto:[email protected]]
发送时间: 2017年12月13日 23:18
收件人: yangjihong <[email protected]>; [email protected]; [email protected]; [email protected]; Daniel J Walsh <[email protected]>; Lukas Vrabec <[email protected]>; Petr Lautrbach <[email protected]>
抄送: [email protected]
主题: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
> Hello,
>
> I am doing stressing testing on 3.10 kernel(centos 7.4), to constantly
> starting numbers of docker ontainers with selinux enabled, and after
> about 2 days, the kernel softlockup panic:
>  <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>  [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
>  [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>  [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>  [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>  <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
>  [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>  [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>  [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>  [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>
> My opinion:
> when the docker container starts, it would mount overlay filesystem
> with different selinux context, mount point such as:
> overlay on
> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc32
> 6cb07495ca08fc9ddb66/merged type overlay
> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,
> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHHWY7:
> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker
> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overl
> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/
> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f
> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
> shm on
> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca57
> 7b8f5d9d6a4adf218d4876/shm type tmpfs
> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
> dbox_file_t:s0:c414,c873",size=65536k)
> overlay on
> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca
> 14ff6d165b94353eefab/merged type overlay
> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,
> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRCRSS:
> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/l
> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d
> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7
> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
> shm on
> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05
> a65866458523ffd4a71614/shm type tmpfs
> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
> dbox_file_t:s0:c431,c651",size=65536k)
>
> sidtab_search_context check the context whether is in the sidtab list,
> If not found, a new node is generated and insert into the list, As the
> number of containers is increasing,  context nodes are also more and
> more, we tested the final number of nodes reached 300,000 +,
> sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
> system softlockup.
>
> Is this a selinux bug? When filesystem umount, why context node is not
> deleted?  I cannot find the relevant function to delete the node in
> sidtab.c
>
> Thanks for reading and looking forward to your reply.

So, does docker just keep allocating a unique category set for every new container, never reusing them even if the container is destroyed?
That would be a bug in docker IMHO. Or are you creating an unbounded number of containers and never destroying the older ones?

On the selinux userspace side, we'd also like to eliminate the use of /sys/fs/selinux/user (sel_write_user -> security_get_user_sids) entirely, which is what triggered this for you.

We cannot currently delete a sidtab node because we have no way of knowing if there are any lingering references to the SID. Fixing that would require reference-counted SIDs, which goes beyond just SELinux since SIDs/secids are returned by LSM hooks and cached in other kernel data structures.

sidtab_search_context() could no doubt be optimized for the negative case; there was an earlier optimization for the positive case by adding a cache to sidtab_context_to_sid() prior to calling it. It's a reverse lookup in the sidtab.


2017-12-14 13:17:26

by Stephen Smalley

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On Thu, 2017-12-14 at 03:19 +0000, yangjihong wrote:
> Hello,
>
> >  So, does docker just keep allocating a unique category set for
> > every new container, never reusing them even if the container is
> > destroyed? 
> >  That would be a bug in docker IMHO.  Or are you creating an
> > unbounded number of containers and never destroying the older ones?
>
> I creat a containers, then destroy it,  and create second one,
> destroy it.......
> When docker created, it will mount overlay fs, because every
> containers has different selinux context, so a new sidtab node is
> generated and insert into the sidtab list  
> When docker destroyed, it will umount overlay fs, but umount
> operation does not seem relevant to "delete the node" hooks function,
> resulting in longer and longer sidtab list
> I think when umount, its selinux context will never reuse, so sidtab
> node is useless, it is best to delete i

The "selinux context will never reuse" is IMHO a bug in docker; if you
truly destroy the container (i.e. don't just stop its execution, but
delete it entirely), then the context should be reusable.

> >  sidtab_search_context() could no doubt be optimized for the
> > negative case; there was an earlier optimization for the positive
> > case by adding a cache to sidtab_context_to_sid() prior to calling
> > it.  It's a reverse lookup in the sidtab.
>
> I think add cache may be not very userful, because every containers
> has different selinux context, so when one docker created, it will
> search the whole sidtab list, until compare the last node, When a new
> node arrives, it is always necessary to compare all the nodes first,
> and then insert. 
> All as long as the list does not delete the node, list will always
> increase, and search time will longer and longer, eventually leading
> to softlockup
>
>
> Is there any solution to this problem?

On the kernel side, we could certainly implement a reverse lookup hash
table. And there could be a faster way to quickly check whether a
given category set has ever been used if we wanted to specialize in
that manner. But that won't fix the fact that docker is allocating
unbounded security contexts.

2017-12-14 16:18:16

by Casey Schaufler

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On 12/13/2017 7:18 AM, Stephen Smalley wrote:
> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>> Hello, 
>>
>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>> constantly starting numbers of docker ontainers with selinux enabled,
>> and after about 2 days, the kernel softlockup panic:
>>  <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>>  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>  [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
>>  [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>>  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>  [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>>  [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>>  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>  <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
>>  [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>>  [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>>  [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>>  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>  [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>>  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>
>> My opinion:
>> when the docker container starts, it would mount overlay filesystem
>> with different selinux context, mount point such as: 
>> overlay on
>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc32
>> 6cb07495ca08fc9ddb66/merged type overlay
>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,
>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHHWY7:
>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker
>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overl
>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/
>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f
>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>> shm on
>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca57
>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
>> dbox_file_t:s0:c414,c873",size=65536k)
>> overlay on
>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca
>> 14ff6d165b94353eefab/merged type overlay
>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,
>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRCRSS:
>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/l
>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d
>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7
>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
>> shm on
>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05
>> a65866458523ffd4a71614/shm type tmpfs
>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
>> dbox_file_t:s0:c431,c651",size=65536k)
>>
>> sidtab_search_context check the context whether is in the sidtab
>> list, If not found, a new node is generated and insert into the list,
>> As the number of containers is increasing,  context nodes are also
>> more and more, we tested the final number of nodes reached 300,000 +,
>> sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
>> system softlockup.
>>
>> Is this a selinux bug? When filesystem umount, why context node is
>> not deleted?  I cannot find the relevant function to delete the node
>> in sidtab.c
>>
>> Thanks for reading and looking forward to your reply.
> So, does docker just keep allocating a unique category set for every
> new container, never reusing them even if the container is destroyed?
> That would be a bug in docker IMHO. Or are you creating an unbounded
> number of containers and never destroying the older ones?

You can't reuse the security context. A process in ContainerA sends
a labeled packet to MachineB. ContainerA goes away and its context
is recycled in ContainerC. MachineB responds some time later, again
with a labeled packet. ContainerC gets information intended for
ContainerA, and uses the information to take over the Elbonian
government.

> On the selinux userspace side, we'd also like to eliminate the use of
> /sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
> entirely, which is what triggered this for you.
>
> We cannot currently delete a sidtab node because we have no way of
> knowing if there are any lingering references to the SID. Fixing that
> would require reference-counted SIDs, which goes beyond just SELinux
> since SIDs/secids are returned by LSM hooks and cached in other kernel
> data structures.

You could delete a sidtab node. The code already deals with unfindable
SIDs. The issue is that eventually you run out of SIDs. Then you are
forced to recycle SIDs, which leads to the overthrow of the Elbonian
government.

> sidtab_search_context() could no doubt be optimized for the negative
> case; there was an earlier optimization for the positive case by adding
> a cache to sidtab_context_to_sid() prior to calling it. It's a reverse
> lookup in the sidtab.

This seems like a bad idea.

2017-12-14 16:52:26

by Stephen Smalley

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
> > On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
> > > Hello, 
> > >
> > > I am doing stressing testing on 3.10 kernel(centos 7.4), to
> > > constantly starting numbers of docker ontainers with selinux
> > > enabled,
> > > and after about 2 days, the kernel softlockup panic:
> > >  <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
> > >  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
> > >  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
> > >  [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
> > >  [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
> > >  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
> > >  [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
> > >  [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
> > >  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
> > >  <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
> > >  [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
> > >  [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
> > >  [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
> > >  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
> > >  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
> > >  [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
> > >  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
> > >  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
> > >  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
> > >
> > > My opinion:
> > > when the docker container starts, it would mount overlay
> > > filesystem
> > > with different selinux context, mount point such as: 
> > > overlay on
> > > /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4
> > > bc32
> > > 6cb07495ca08fc9ddb66/merged type overlay
> > > (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
> > > 414,
> > > c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHH
> > > WY7:
> > > /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/do
> > > cker
> > > /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/o
> > > verl
> > > ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9dd
> > > b66/
> > > diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952e
> > > ae4f
> > > 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
> > > shm on
> > > /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327
> > > ca57
> > > 7b8f5d9d6a4adf218d4876/shm type tmpfs
> > > (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
> > > _san
> > > dbox_file_t:s0:c414,c873",size=65536k)
> > > overlay on
> > > /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258
> > > cbca
> > > 14ff6d165b94353eefab/merged type overlay
> > > (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
> > > 431,
> > > c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRC
> > > RSS:
> > > /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/v
> > > ar/l
> > > ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14
> > > ff6d
> > > 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d0801
> > > 45c7
> > > d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
> > > shm on
> > > /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1d
> > > cf05
> > > a65866458523ffd4a71614/shm type tmpfs
> > > (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
> > > _san
> > > dbox_file_t:s0:c431,c651",size=65536k)
> > >
> > > sidtab_search_context check the context whether is in the sidtab
> > > list, If not found, a new node is generated and insert into the
> > > list,
> > > As the number of containers is increasing,  context nodes are
> > > also
> > > more and more, we tested the final number of nodes reached
> > > 300,000 +,
> > > sidtab_context_to_sid runtime needs 100-200ms, which will lead to
> > > the
> > > system softlockup.
> > >
> > > Is this a selinux bug? When filesystem umount, why context node
> > > is
> > > not deleted?  I cannot find the relevant function to delete the
> > > node
> > > in sidtab.c
> > >
> > > Thanks for reading and looking forward to your reply.
> >
> > So, does docker just keep allocating a unique category set for
> > every
> > new container, never reusing them even if the container is
> > destroyed? 
> > That would be a bug in docker IMHO.  Or are you creating an
> > unbounded
> > number of containers and never destroying the older ones?
>
> You can't reuse the security context. A process in ContainerA sends
> a labeled packet to MachineB. ContainerA goes away and its context
> is recycled in ContainerC. MachineB responds some time later, again
> with a labeled packet. ContainerC gets information intended for
> ContainerA, and uses the information to take over the Elbonian
> government.

Docker isn't using labeled networking (nor is anything else by default;
it is only enabled if explicitly configured).

>
> > On the selinux userspace side, we'd also like to eliminate the use
> > of
> > /sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
> > entirely, which is what triggered this for you.
> >
> > We cannot currently delete a sidtab node because we have no way of
> > knowing if there are any lingering references to the SID.  Fixing
> > that
> > would require reference-counted SIDs, which goes beyond just
> > SELinux
> > since SIDs/secids are returned by LSM hooks and cached in other
> > kernel
> > data structures.
>
> You could delete a sidtab node. The code already deals with
> unfindable
> SIDs. The issue is that eventually you run out of SIDs. Then you are
> forced to recycle SIDs, which leads to the overthrow of the Elbonian
> government.

We don't know when we can safely delete a sidtab node since SIDs aren't
reference counted and we can't know whether it is still in use
somewhere in the kernel. Doing so prematurely would lead to the SID
being remapped to the unlabeled context, and then likely to undesired
denials.

>
> > sidtab_search_context() could no doubt be optimized for the
> > negative
> > case; there was an earlier optimization for the positive case by
> > adding
> > a cache to sidtab_context_to_sid() prior to calling it.  It's a
> > reverse
> > lookup in the sidtab.
>
> This seems like a bad idea.

Not sure what you mean, but it can certainly be changed to at least use
a hash table for these reverse lookups.

2017-12-14 17:00:53

by Casey Schaufler

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On 12/14/2017 8:42 AM, Stephen Smalley wrote:
> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>> Hello, 
>>>>
>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>>>> constantly starting numbers of docker ontainers with selinux
>>>> enabled,
>>>> and after about 2 days, the kernel softlockup panic:
>>>>  <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>>>>  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>  [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>  [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>>>>  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>  [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>>>>  [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>>>>  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>  <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
>>>>  [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>>>>  [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>>>>  [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>>>>  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>  [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>>>>  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>
>>>> My opinion:
>>>> when the docker container starts, it would mount overlay
>>>> filesystem
>>>> with different selinux context, mount point such as: 
>>>> overlay on
>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4
>>>> bc32
>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
>>>> 414,
>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHH
>>>> WY7:
>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/do
>>>> cker
>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/o
>>>> verl
>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9dd
>>>> b66/
>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952e
>>>> ae4f
>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>> shm on
>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327
>>>> ca57
>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
>>>> _san
>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>> overlay on
>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258
>>>> cbca
>>>> 14ff6d165b94353eefab/merged type overlay
>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
>>>> 431,
>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRC
>>>> RSS:
>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/v
>>>> ar/l
>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14
>>>> ff6d
>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d0801
>>>> 45c7
>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
>>>> shm on
>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1d
>>>> cf05
>>>> a65866458523ffd4a71614/shm type tmpfs
>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
>>>> _san
>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>
>>>> sidtab_search_context check the context whether is in the sidtab
>>>> list, If not found, a new node is generated and insert into the
>>>> list,
>>>> As the number of containers is increasing,  context nodes are
>>>> also
>>>> more and more, we tested the final number of nodes reached
>>>> 300,000 +,
>>>> sidtab_context_to_sid runtime needs 100-200ms, which will lead to
>>>> the
>>>> system softlockup.
>>>>
>>>> Is this a selinux bug? When filesystem umount, why context node
>>>> is
>>>> not deleted?  I cannot find the relevant function to delete the
>>>> node
>>>> in sidtab.c
>>>>
>>>> Thanks for reading and looking forward to your reply.
>>> So, does docker just keep allocating a unique category set for
>>> every
>>> new container, never reusing them even if the container is
>>> destroyed? 
>>> That would be a bug in docker IMHO.  Or are you creating an
>>> unbounded
>>> number of containers and never destroying the older ones?
>> You can't reuse the security context. A process in ContainerA sends
>> a labeled packet to MachineB. ContainerA goes away and its context
>> is recycled in ContainerC. MachineB responds some time later, again
>> with a labeled packet. ContainerC gets information intended for
>> ContainerA, and uses the information to take over the Elbonian
>> government.
> Docker isn't using labeled networking (nor is anything else by default;
> it is only enabled if explicitly configured).

If labeled networking weren't an issue we'd have full security
module stacking by now. Yes, it's an edge case. If you want to
use labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got
the same problem.


>>> On the selinux userspace side, we'd also like to eliminate the use
>>> of
>>> /sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
>>> entirely, which is what triggered this for you.
>>>
>>> We cannot currently delete a sidtab node because we have no way of
>>> knowing if there are any lingering references to the SID.  Fixing
>>> that
>>> would require reference-counted SIDs, which goes beyond just
>>> SELinux
>>> since SIDs/secids are returned by LSM hooks and cached in other
>>> kernel
>>> data structures.
>> You could delete a sidtab node. The code already deals with
>> unfindable
>> SIDs. The issue is that eventually you run out of SIDs. Then you are
>> forced to recycle SIDs, which leads to the overthrow of the Elbonian
>> government.
> We don't know when we can safely delete a sidtab node since SIDs aren't
> reference counted and we can't know whether it is still in use
> somewhere in the kernel. Doing so prematurely would lead to the SID
> being remapped to the unlabeled context, and then likely to undesired
> denials.

I would suggest that if you delete a sidtab node and someone
comes along later and tries to use it that denial is exactly
what you would desire. I don't see any other rational action.

>>> sidtab_search_context() could no doubt be optimized for the
>>> negative
>>> case; there was an earlier optimization for the positive case by
>>> adding
>>> a cache to sidtab_context_to_sid() prior to calling it.  It's a
>>> reverse
>>> lookup in the sidtab.
>> This seems like a bad idea.
> Not sure what you mean, but it can certainly be changed to at least use
> a hash table for these reverse lookups.
>
>

2017-12-14 17:15:47

by Stephen Smalley

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
> > On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
> > > On 12/13/2017 7:18 AM, Stephen Smalley wrote:
> > > > On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
> > > > > Hello, 
> > > > >
> > > > > I am doing stressing testing on 3.10 kernel(centos 7.4), to
> > > > > constantly starting numbers of docker ontainers with selinux
> > > > > enabled,
> > > > > and after about 2 days, the kernel softlockup panic:
> > > > >  <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
> > > > >  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
> > > > >  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
> > > > >  [<ffffffff811224d0>] ?
> > > > > watchdog_enable_all_cpus.part.4+0x40/0x40
> > > > >  [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
> > > > >  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
> > > > >  [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
> > > > >  [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
> > > > >  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
> > > > >  <EOI>  [<ffffffff812b4193>] ?
> > > > > sidtab_context_to_sid+0xb3/0x480
> > > > >  [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
> > > > >  [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
> > > > >  [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
> > > > >  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
> > > > >  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
> > > > >  [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
> > > > >  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
> > > > >  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
> > > > >  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
> > > > >
> > > > > My opinion:
> > > > > when the docker container starts, it would mount overlay
> > > > > filesystem
> > > > > with different selinux context, mount point such as: 
> > > > > overlay on
> > > > > /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
> > > > > 07b4
> > > > > bc32
> > > > > 6cb07495ca08fc9ddb66/merged type overlay
> > > > > (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
> > > > > s0:c
> > > > > 414,
> > > > > c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
> > > > > ARHH
> > > > > WY7:
> > > > > /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
> > > > > b/do
> > > > > cker
> > > > > /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
> > > > > er/o
> > > > > verl
> > > > > ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
> > > > > c9dd
> > > > > b66/
> > > > > diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
> > > > > 952e
> > > > > ae4f
> > > > > 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
> > > > > shm on
> > > > > /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
> > > > > 1327
> > > > > ca57
> > > > > 7b8f5d9d6a4adf218d4876/shm type tmpfs
> > > > > (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
> > > > > virt
> > > > > _san
> > > > > dbox_file_t:s0:c414,c873",size=65536k)
> > > > > overlay on
> > > > > /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
> > > > > 7258
> > > > > cbca
> > > > > 14ff6d165b94353eefab/merged type overlay
> > > > > (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
> > > > > s0:c
> > > > > 431,
> > > > > c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
> > > > > AVRC
> > > > > RSS:
> > > > > /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
> > > > > r=/v
> > > > > ar/l
> > > > > ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
> > > > > ca14
> > > > > ff6d
> > > > > 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
> > > > > 0801
> > > > > 45c7
> > > > > d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
> > > > > shm on
> > > > > /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
> > > > > dc1d
> > > > > cf05
> > > > > a65866458523ffd4a71614/shm type tmpfs
> > > > > (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
> > > > > virt
> > > > > _san
> > > > > dbox_file_t:s0:c431,c651",size=65536k)
> > > > >
> > > > > sidtab_search_context check the context whether is in the
> > > > > sidtab
> > > > > list, If not found, a new node is generated and insert into
> > > > > the
> > > > > list,
> > > > > As the number of containers is increasing,  context nodes are
> > > > > also
> > > > > more and more, we tested the final number of nodes reached
> > > > > 300,000 +,
> > > > > sidtab_context_to_sid runtime needs 100-200ms, which will
> > > > > lead to
> > > > > the
> > > > > system softlockup.
> > > > >
> > > > > Is this a selinux bug? When filesystem umount, why context
> > > > > node
> > > > > is
> > > > > not deleted?  I cannot find the relevant function to delete
> > > > > the
> > > > > node
> > > > > in sidtab.c
> > > > >
> > > > > Thanks for reading and looking forward to your reply.
> > > >
> > > > So, does docker just keep allocating a unique category set for
> > > > every
> > > > new container, never reusing them even if the container is
> > > > destroyed? 
> > > > That would be a bug in docker IMHO.  Or are you creating an
> > > > unbounded
> > > > number of containers and never destroying the older ones?
> > >
> > > You can't reuse the security context. A process in ContainerA
> > > sends
> > > a labeled packet to MachineB. ContainerA goes away and its
> > > context
> > > is recycled in ContainerC. MachineB responds some time later,
> > > again
> > > with a labeled packet. ContainerC gets information intended for
> > > ContainerA, and uses the information to take over the Elbonian
> > > government.
> >
> > Docker isn't using labeled networking (nor is anything else by
> > default;
> > it is only enabled if explicitly configured).
>
> If labeled networking weren't an issue we'd have full security
> module stacking by now. Yes, it's an edge case. If you want to
> use labeled NFS or a local filesystem that gets mounted in each
> container (don't tell me that nobody would do that) you've got
> the same problem.

Even if someone were to configure labeled networking, Docker is not
presently relying on that or SELinux network enforcement for any
security properties, so it really doesn't matter. And if they wanted
to do that, they'd have to coordinate category assignments across all
systems involved, for which no facility exists AFAIK. If you have two
docker instances running on different hosts, I'd wager that they can
hand out the same category sets today to different containers.

With respect to labeled NFS, that's also not the default for nfs
mounts, so again it is a custom configuration and Docker isn't relying
on it for any guarantees today. For local filesystems, they would
normally be context-mounted or using genfscon rather than xattrs in
order to be accessible to the container, thus no persistent storage of
the category sets.

Certainly docker could provide an option to not reuse category sets,
but making that the default is not sane and just guarantees exhaustion
of the SID and context space (just create and tear down lots of
containers every day or more frequently).

>
> > > > On the selinux userspace side, we'd also like to eliminate the
> > > > use
> > > > of
> > > > /sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
> > > > entirely, which is what triggered this for you.
> > > >
> > > > We cannot currently delete a sidtab node because we have no way
> > > > of
> > > > knowing if there are any lingering references to the
> > > > SID.  Fixing
> > > > that
> > > > would require reference-counted SIDs, which goes beyond just
> > > > SELinux
> > > > since SIDs/secids are returned by LSM hooks and cached in other
> > > > kernel
> > > > data structures.
> > >
> > > You could delete a sidtab node. The code already deals with
> > > unfindable
> > > SIDs. The issue is that eventually you run out of SIDs. Then you
> > > are
> > > forced to recycle SIDs, which leads to the overthrow of the
> > > Elbonian
> > > government.
> >
> > We don't know when we can safely delete a sidtab node since SIDs
> > aren't
> > reference counted and we can't know whether it is still in use
> > somewhere in the kernel.  Doing so prematurely would lead to the
> > SID
> > being remapped to the unlabeled context, and then likely to
> > undesired
> > denials.
>
> I would suggest that if you delete a sidtab node and someone
> comes along later and tries to use it that denial is exactly
> what you would desire. I don't see any other rational action.

Yes, if we know that the SID wasn't in use at the time we tore it down.
But if we're just randomly deleting sidtab entries based on age or
something (since we have no reference count), we'll almost certainly
encounter situations where a SID hasn't been accessed in a long time
but is still being legitimately cached somewhere. Just a file that
hasn't been accessed in a while might have that SID still cached in its
inode security blob, or anywhere else.

>
> > > > sidtab_search_context() could no doubt be optimized for the
> > > > negative
> > > > case; there was an earlier optimization for the positive case
> > > > by
> > > > adding
> > > > a cache to sidtab_context_to_sid() prior to calling it.  It's a
> > > > reverse
> > > > lookup in the sidtab.
> > >
> > > This seems like a bad idea.
> >
> > Not sure what you mean, but it can certainly be changed to at least
> > use
> > a hash table for these reverse lookups.
> >
> >

2017-12-14 17:42:37

by Casey Schaufler

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On 12/14/2017 9:15 AM, Stephen Smalley wrote:
> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
>> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>>>> Hello, 
>>>>>>
>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>>>>>> constantly starting numbers of docker ontainers with selinux
>>>>>> enabled,
>>>>>> and after about 2 days, the kernel softlockup panic:
>>>>>>  <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>>>>>>  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>>>  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>>>  [<ffffffff811224d0>] ?
>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>>>  [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>>>>>>  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>>>  [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>>>>>>  [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>>>>>>  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>>>  <EOI>  [<ffffffff812b4193>] ?
>>>>>> sidtab_context_to_sid+0xb3/0x480
>>>>>>  [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>>>>>>  [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>>>>>>  [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>>>>>>  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>>>  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>>>  [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>>>>>>  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>>>  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>>>  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>>>
>>>>>> My opinion:
>>>>>> when the docker container starts, it would mount overlay
>>>>>> filesystem
>>>>>> with different selinux context, mount point such as: 
>>>>>> overlay on
>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
>>>>>> 07b4
>>>>>> bc32
>>>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>> s0:c
>>>>>> 414,
>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
>>>>>> ARHH
>>>>>> WY7:
>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
>>>>>> b/do
>>>>>> cker
>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
>>>>>> er/o
>>>>>> verl
>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
>>>>>> c9dd
>>>>>> b66/
>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
>>>>>> 952e
>>>>>> ae4f
>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>>>> shm on
>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
>>>>>> 1327
>>>>>> ca57
>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>> virt
>>>>>> _san
>>>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>>>> overlay on
>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
>>>>>> 7258
>>>>>> cbca
>>>>>> 14ff6d165b94353eefab/merged type overlay
>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>> s0:c
>>>>>> 431,
>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
>>>>>> AVRC
>>>>>> RSS:
>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
>>>>>> r=/v
>>>>>> ar/l
>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
>>>>>> ca14
>>>>>> ff6d
>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
>>>>>> 0801
>>>>>> 45c7
>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
>>>>>> shm on
>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
>>>>>> dc1d
>>>>>> cf05
>>>>>> a65866458523ffd4a71614/shm type tmpfs
>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>> virt
>>>>>> _san
>>>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>>>
>>>>>> sidtab_search_context check the context whether is in the
>>>>>> sidtab
>>>>>> list, If not found, a new node is generated and insert into
>>>>>> the
>>>>>> list,
>>>>>> As the number of containers is increasing,  context nodes are
>>>>>> also
>>>>>> more and more, we tested the final number of nodes reached
>>>>>> 300,000 +,
>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will
>>>>>> lead to
>>>>>> the
>>>>>> system softlockup.
>>>>>>
>>>>>> Is this a selinux bug? When filesystem umount, why context
>>>>>> node
>>>>>> is
>>>>>> not deleted?  I cannot find the relevant function to delete
>>>>>> the
>>>>>> node
>>>>>> in sidtab.c
>>>>>>
>>>>>> Thanks for reading and looking forward to your reply.
>>>>> So, does docker just keep allocating a unique category set for
>>>>> every
>>>>> new container, never reusing them even if the container is
>>>>> destroyed? 
>>>>> That would be a bug in docker IMHO.  Or are you creating an
>>>>> unbounded
>>>>> number of containers and never destroying the older ones?
>>>> You can't reuse the security context. A process in ContainerA
>>>> sends
>>>> a labeled packet to MachineB. ContainerA goes away and its
>>>> context
>>>> is recycled in ContainerC. MachineB responds some time later,
>>>> again
>>>> with a labeled packet. ContainerC gets information intended for
>>>> ContainerA, and uses the information to take over the Elbonian
>>>> government.
>>> Docker isn't using labeled networking (nor is anything else by
>>> default;
>>> it is only enabled if explicitly configured).
>> If labeled networking weren't an issue we'd have full security
>> module stacking by now. Yes, it's an edge case. If you want to
>> use labeled NFS or a local filesystem that gets mounted in each
>> container (don't tell me that nobody would do that) you've got
>> the same problem.
> Even if someone were to configure labeled networking, Docker is not
> presently relying on that or SELinux network enforcement for any
> security properties, so it really doesn't matter.

True enough. I can imagine a use case, but as you point out, it
would be a very complex configuration and coordination exercise
using SELinux.

> And if they wanted
> to do that, they'd have to coordinate category assignments across all
> systems involved, for which no facility exists AFAIK. If you have two
> docker instances running on different hosts, I'd wager that they can
> hand out the same category sets today to different containers.
>
> With respect to labeled NFS, that's also not the default for nfs
> mounts, so again it is a custom configuration and Docker isn't relying
> on it for any guarantees today. For local filesystems, they would
> normally be context-mounted or using genfscon rather than xattrs in
> order to be accessible to the container, thus no persistent storage of
> the category sets.

I know that is the intended configuration, but I see people do
all sorts of stoopid things for what they believe are good reasons.
Unfortunately, lots of people count on containers to provide
isolation, but create "solutions" for data sharing that defeat it.

> Certainly docker could provide an option to not reuse category sets,
> but making that the default is not sane and just guarantees exhaustion
> of the SID and context space (just create and tear down lots of
> containers every day or more frequently).

It seems that Docker might have a similar issue with UIDs,
but it takes longer to run out of UIDs than sidtab entries.

>
>>>>> On the selinux userspace side, we'd also like to eliminate the
>>>>> use
>>>>> of
>>>>> /sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
>>>>> entirely, which is what triggered this for you.
>>>>>
>>>>> We cannot currently delete a sidtab node because we have no way
>>>>> of
>>>>> knowing if there are any lingering references to the
>>>>> SID.  Fixing
>>>>> that
>>>>> would require reference-counted SIDs, which goes beyond just
>>>>> SELinux
>>>>> since SIDs/secids are returned by LSM hooks and cached in other
>>>>> kernel
>>>>> data structures.
>>>> You could delete a sidtab node. The code already deals with
>>>> unfindable
>>>> SIDs. The issue is that eventually you run out of SIDs. Then you
>>>> are
>>>> forced to recycle SIDs, which leads to the overthrow of the
>>>> Elbonian
>>>> government.
>>> We don't know when we can safely delete a sidtab node since SIDs
>>> aren't
>>> reference counted and we can't know whether it is still in use
>>> somewhere in the kernel.  Doing so prematurely would lead to the
>>> SID
>>> being remapped to the unlabeled context, and then likely to
>>> undesired
>>> denials.
>> I would suggest that if you delete a sidtab node and someone
>> comes along later and tries to use it that denial is exactly
>> what you would desire. I don't see any other rational action.
> Yes, if we know that the SID wasn't in use at the time we tore it down.
> But if we're just randomly deleting sidtab entries based on age or
> something (since we have no reference count), we'll almost certainly
> encounter situations where a SID hasn't been accessed in a long time
> but is still being legitimately cached somewhere. Just a file that
> hasn't been accessed in a while might have that SID still cached in its
> inode security blob, or anywhere else.
>
>>>>> sidtab_search_context() could no doubt be optimized for the
>>>>> negative
>>>>> case; there was an earlier optimization for the positive case
>>>>> by
>>>>> adding
>>>>> a cache to sidtab_context_to_sid() prior to calling it.  It's a
>>>>> reverse
>>>>> lookup in the sidtab.
>>>> This seems like a bad idea.
>>> Not sure what you mean, but it can certainly be changed to at least
>>> use
>>> a hash table for these reverse lookups.
>>>
>>>
>

2017-12-14 18:11:51

by Daniel Walsh

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On 12/14/2017 12:42 PM, Casey Schaufler wrote:
> On 12/14/2017 9:15 AM, Stephen Smalley wrote:
>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>>>>>>> constantly starting numbers of docker ontainers with selinux
>>>>>>> enabled,
>>>>>>> and after about 2 days, the kernel softlockup panic:
>>>>>>> <IRQ> [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>>>>>>> [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>>>> [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>>>> [<ffffffff811224d0>] ?
>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>>>> [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>>>>>>> [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>>>> [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>>>>>>> [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>>>>>>> [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>>>> <EOI> [<ffffffff812b4193>] ?
>>>>>>> sidtab_context_to_sid+0xb3/0x480
>>>>>>> [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>>>>>>> [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>>>>>>> [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>>>>>>> [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>>>> [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>>>> [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>>>>>>> [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>>>> [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>>>> [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>>>>
>>>>>>> My opinion:
>>>>>>> when the docker container starts, it would mount overlay
>>>>>>> filesystem
>>>>>>> with different selinux context, mount point such as:
>>>>>>> overlay on
>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
>>>>>>> 07b4
>>>>>>> bc32
>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>>> s0:c
>>>>>>> 414,
>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
>>>>>>> ARHH
>>>>>>> WY7:
>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
>>>>>>> b/do
>>>>>>> cker
>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
>>>>>>> er/o
>>>>>>> verl
>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
>>>>>>> c9dd
>>>>>>> b66/
>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
>>>>>>> 952e
>>>>>>> ae4f
>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>>>>> shm on
>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
>>>>>>> 1327
>>>>>>> ca57
>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>>> virt
>>>>>>> _san
>>>>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>>>>> overlay on
>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
>>>>>>> 7258
>>>>>>> cbca
>>>>>>> 14ff6d165b94353eefab/merged type overlay
>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>>> s0:c
>>>>>>> 431,
>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
>>>>>>> AVRC
>>>>>>> RSS:
>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
>>>>>>> r=/v
>>>>>>> ar/l
>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
>>>>>>> ca14
>>>>>>> ff6d
>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
>>>>>>> 0801
>>>>>>> 45c7
>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
>>>>>>> shm on
>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
>>>>>>> dc1d
>>>>>>> cf05
>>>>>>> a65866458523ffd4a71614/shm type tmpfs
>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>>> virt
>>>>>>> _san
>>>>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>>>>
>>>>>>> sidtab_search_context check the context whether is in the
>>>>>>> sidtab
>>>>>>> list, If not found, a new node is generated and insert into
>>>>>>> the
>>>>>>> list,
>>>>>>> As the number of containers is increasing, context nodes are
>>>>>>> also
>>>>>>> more and more, we tested the final number of nodes reached
>>>>>>> 300,000 +,
>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will
>>>>>>> lead to
>>>>>>> the
>>>>>>> system softlockup.
>>>>>>>
>>>>>>> Is this a selinux bug? When filesystem umount, why context
>>>>>>> node
>>>>>>> is
>>>>>>> not deleted? I cannot find the relevant function to delete
>>>>>>> the
>>>>>>> node
>>>>>>> in sidtab.c
>>>>>>>
>>>>>>> Thanks for reading and looking forward to your reply.
>>>>>> So, does docker just keep allocating a unique category set for
>>>>>> every
>>>>>> new container, never reusing them even if the container is
>>>>>> destroyed?
>>>>>> That would be a bug in docker IMHO. Or are you creating an
>>>>>> unbounded
>>>>>> number of containers and never destroying the older ones?
>>>>> You can't reuse the security context. A process in ContainerA
>>>>> sends
>>>>> a labeled packet to MachineB. ContainerA goes away and its
>>>>> context
>>>>> is recycled in ContainerC. MachineB responds some time later,
>>>>> again
>>>>> with a labeled packet. ContainerC gets information intended for
>>>>> ContainerA, and uses the information to take over the Elbonian
>>>>> government.
>>>> Docker isn't using labeled networking (nor is anything else by
>>>> default;
>>>> it is only enabled if explicitly configured).
>>> If labeled networking weren't an issue we'd have full security
>>> module stacking by now. Yes, it's an edge case. If you want to
>>> use labeled NFS or a local filesystem that gets mounted in each
>>> container (don't tell me that nobody would do that) you've got
>>> the same problem.
>> Even if someone were to configure labeled networking, Docker is not
>> presently relying on that or SELinux network enforcement for any
>> security properties, so it really doesn't matter.
> True enough. I can imagine a use case, but as you point out, it
> would be a very complex configuration and coordination exercise
> using SELinux.
>
>> And if they wanted
>> to do that, they'd have to coordinate category assignments across all
>> systems involved, for which no facility exists AFAIK. If you have two
>> docker instances running on different hosts, I'd wager that they can
>> hand out the same category sets today to different containers.
>>
>> With respect to labeled NFS, that's also not the default for nfs
>> mounts, so again it is a custom configuration and Docker isn't relying
>> on it for any guarantees today. For local filesystems, they would
>> normally be context-mounted or using genfscon rather than xattrs in
>> order to be accessible to the container, thus no persistent storage of
>> the category sets.
Well Kubernetes and OpenShift do set the labels to be the same within a
project, and they can manage
across nodes. But yes we are not using labeled networking at this point.
> I know that is the intended configuration, but I see people do
> all sorts of stoopid things for what they believe are good reasons.
> Unfortunately, lots of people count on containers to provide
> isolation, but create "solutions" for data sharing that defeat it.
>
>> Certainly docker could provide an option to not reuse category sets,
>> but making that the default is not sane and just guarantees exhaustion
>> of the SID and context space (just create and tear down lots of
>> containers every day or more frequently).
> It seems that Docker might have a similar issue with UIDs,
> but it takes longer to run out of UIDs than sidtab entries.
>
>>>>>> On the selinux userspace side, we'd also like to eliminate the
>>>>>> use
>>>>>> of
>>>>>> /sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
>>>>>> entirely, which is what triggered this for you.
>>>>>>
>>>>>> We cannot currently delete a sidtab node because we have no way
>>>>>> of
>>>>>> knowing if there are any lingering references to the
>>>>>> SID. Fixing
>>>>>> that
>>>>>> would require reference-counted SIDs, which goes beyond just
>>>>>> SELinux
>>>>>> since SIDs/secids are returned by LSM hooks and cached in other
>>>>>> kernel
>>>>>> data structures.
>>>>> You could delete a sidtab node. The code already deals with
>>>>> unfindable
>>>>> SIDs. The issue is that eventually you run out of SIDs. Then you
>>>>> are
>>>>> forced to recycle SIDs, which leads to the overthrow of the
>>>>> Elbonian
>>>>> government.
>>>> We don't know when we can safely delete a sidtab node since SIDs
>>>> aren't
>>>> reference counted and we can't know whether it is still in use
>>>> somewhere in the kernel. Doing so prematurely would lead to the
>>>> SID
>>>> being remapped to the unlabeled context, and then likely to
>>>> undesired
>>>> denials.
>>> I would suggest that if you delete a sidtab node and someone
>>> comes along later and tries to use it that denial is exactly
>>> what you would desire. I don't see any other rational action.
>> Yes, if we know that the SID wasn't in use at the time we tore it down.
>> But if we're just randomly deleting sidtab entries based on age or
>> something (since we have no reference count), we'll almost certainly
>> encounter situations where a SID hasn't been accessed in a long time
>> but is still being legitimately cached somewhere. Just a file that
>> hasn't been accessed in a while might have that SID still cached in its
>> inode security blob, or anywhere else.
>>
>>>>>> sidtab_search_context() could no doubt be optimized for the
>>>>>> negative
>>>>>> case; there was an earlier optimization for the positive case
>>>>>> by
>>>>>> adding
>>>>>> a cache to sidtab_context_to_sid() prior to calling it. It's a
>>>>>> reverse
>>>>>> lookup in the sidtab.
>>>>> This seems like a bad idea.
>>>> Not sure what you mean, but it can certainly be changed to at least
>>>> use
>>>> a hash table for these reverse lookups.
>>>>
>>>>
>
>
>

2017-12-15 03:09:37

by Yang Jihong

[permalink] [raw]
Subject: 答复: [BUG]kernel softlockup due to sidtab_ search_context run for long time because of t oo many sidtab context node

On 12/15/2017 10:31 PM, yangjihong wrote:
>On 12/14/2017 12:42 PM, Casey Schaufler wrote:
>> On 12/14/2017 9:15 AM, Stephen Smalley wrote:
>>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
>>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
>>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>>>>>>>> constantly starting numbers of docker ontainers with selinux
>>>>>>>> enabled, and after about 2 days, the kernel softlockup panic:
>>>>>>>> <IRQ> [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>>>>>>>> [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>>>>> [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>>>>> [<ffffffff811224d0>] ?
>>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>>>>> [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>>>>>>>> [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>>>>> [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>>>>>>>> [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>>>>>>>> [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>>>>> <EOI> [<ffffffff812b4193>] ?
>>>>>>>> sidtab_context_to_sid+0xb3/0x480
>>>>>>>> [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>>>>>>>> [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>>>>>>>> [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>>>>>>>> [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>>>>> [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>>>>> [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>>>>>>>> [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>>>>> [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>>>>> [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>>>>>
>>>>>>>> My opinion:
>>>>>>>> when the docker container starts, it would mount overlay
>>>>>>>> filesystem with different selinux context, mount point such as:
>>>>>>>> overlay on
>>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
>>>>>>>> 07b4
>>>>>>>> bc32
>>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>>>> s0:c
>>>>>>>> 414,
>>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
>>>>>>>> ARHH
>>>>>>>> WY7:
>>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
>>>>>>>> b/do
>>>>>>>> cker
>>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
>>>>>>>> er/o
>>>>>>>> verl
>>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
>>>>>>>> c9dd
>>>>>>>> b66/
>>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
>>>>>>>> 952e
>>>>>>>> ae4f
>>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>>>>>> shm on
>>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
>>>>>>>> 1327
>>>>>>>> ca57
>>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>>>> virt
>>>>>>>> _san
>>>>>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>>>>>> overlay on
>>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
>>>>>>>> 7258
>>>>>>>> cbca
>>>>>>>> 14ff6d165b94353eefab/merged type overlay
>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>>>> s0:c
>>>>>>>> 431,
>>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
>>>>>>>> AVRC
>>>>>>>> RSS:
>>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
>>>>>>>> r=/v
>>>>>>>> ar/l
>>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
>>>>>>>> ca14
>>>>>>>> ff6d
>>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
>>>>>>>> 0801
>>>>>>>> 45c7
>>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
>>>>>>>> shm on
>>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
>>>>>>>> dc1d
>>>>>>>> cf05
>>>>>>>> a65866458523ffd4a71614/shm type tmpfs
>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>>>> virt
>>>>>>>> _san
>>>>>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>>>>>
>>>>>>>> sidtab_search_context check the context whether is in the sidtab
>>>>>>>> list, If not found, a new node is generated and insert into the
>>>>>>>> list, As the number of containers is increasing, context nodes
>>>>>>>> are also more and more, we tested the final number of nodes
>>>>>>>> reached
>>>>>>>> 300,000 +,
>>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will lead
>>>>>>>> to the system softlockup.
>>>>>>>>
>>>>>>>> Is this a selinux bug? When filesystem umount, why context node
>>>>>>>> is not deleted? I cannot find the relevant function to delete
>>>>>>>> the node in sidtab.c
>>>>>>>>
>>>>>>>> Thanks for reading and looking forward to your reply.
>>>>>>> So, does docker just keep allocating a unique category set for
>>>>>>> every new container, never reusing them even if the container is
>>>>>>> destroyed?
>>>>>>> That would be a bug in docker IMHO. Or are you creating an
>>>>>>> unbounded number of containers and never destroying the older
>>>>>>> ones?
>>>>>> You can't reuse the security context. A process in ContainerA
>>>>>> sends a labeled packet to MachineB. ContainerA goes away and its
>>>>>> context is recycled in ContainerC. MachineB responds some time
>>>>>> later, again with a labeled packet. ContainerC gets information
>>>>>> intended for ContainerA, and uses the information to take over the
>>>>>> Elbonian government.
>>>>> Docker isn't using labeled networking (nor is anything else by
>>>>> default; it is only enabled if explicitly configured).
>>>> If labeled networking weren't an issue we'd have full security
>>>> module stacking by now. Yes, it's an edge case. If you want to use
>>>> labeled NFS or a local filesystem that gets mounted in each
>>>> container (don't tell me that nobody would do that) you've got the
>>>> same problem.
>>> Even if someone were to configure labeled networking, Docker is not
>>> presently relying on that or SELinux network enforcement for any
>>> security properties, so it really doesn't matter.
>> True enough. I can imagine a use case, but as you point out, it would
>> be a very complex configuration and coordination exercise using
>> SELinux.
>>
>>> And if they wanted
>>> to do that, they'd have to coordinate category assignments across all
>>> systems involved, for which no facility exists AFAIK. If you have
>>> two docker instances running on different hosts, I'd wager that they
>>> can hand out the same category sets today to different containers.
>>>
>>> With respect to labeled NFS, that's also not the default for nfs
>>> mounts, so again it is a custom configuration and Docker isn't
>>> relying on it for any guarantees today. For local filesystems, they
>>> would normally be context-mounted or using genfscon rather than
>>> xattrs in order to be accessible to the container, thus no persistent
>>> storage of the category sets.
>Well Kubernetes and OpenShift do set the labels to be the same within a project, and they can manage across nodes. But yes we are not using labeled networking at this point.
>> I know that is the intended configuration, but I see people do all
>> sorts of stoopid things for what they believe are good reasons.
>> Unfortunately, lots of people count on containers to provide
>> isolation, but create "solutions" for data sharing that defeat it.
>>
>>> Certainly docker could provide an option to not reuse category sets,
>>> but making that the default is not sane and just guarantees
>>> exhaustion of the SID and context space (just create and tear down
>>> lots of containers every day or more frequently).
>> It seems that Docker might have a similar issue with UIDs, but it
>> takes longer to run out of UIDs than sidtab entries.
>>
>>>>>>> On the selinux userspace side, we'd also like to eliminate the
>>>>>>> use of /sys/fs/selinux/user (sel_write_user ->
>>>>>>> security_get_user_sids) entirely, which is what triggered this
>>>>>>> for you.
>>>>>>>
>>>>>>> We cannot currently delete a sidtab node because we have no way
>>>>>>> of knowing if there are any lingering references to the SID.
>>>>>>> Fixing that would require reference-counted SIDs, which goes
>>>>>>> beyond just SELinux since SIDs/secids are returned by LSM hooks
>>>>>>> and cached in other kernel data structures.
>>>>>> You could delete a sidtab node. The code already deals with
>>>>>> unfindable SIDs. The issue is that eventually you run out of SIDs.
>>>>>> Then you are forced to recycle SIDs, which leads to the overthrow
>>>>>> of the Elbonian government.
>>>>> We don't know when we can safely delete a sidtab node since SIDs
>>>>> aren't reference counted and we can't know whether it is still in
>>>>> use somewhere in the kernel. Doing so prematurely would lead to
>>>>> the SID being remapped to the unlabeled context, and then likely to
>>>>> undesired denials.
>>>> I would suggest that if you delete a sidtab node and someone comes
>>>> along later and tries to use it that denial is exactly what you
>>>> would desire. I don't see any other rational action.
>>> Yes, if we know that the SID wasn't in use at the time we tore it down.
>>> But if we're just randomly deleting sidtab entries based on age or
>>> something (since we have no reference count), we'll almost certainly
>>> encounter situations where a SID hasn't been accessed in a long time
>>> but is still being legitimately cached somewhere. Just a file that
>>> hasn't been accessed in a while might have that SID still cached in
>>> its inode security blob, or anywhere else.
>>>
>>>>>>> sidtab_search_context() could no doubt be optimized for the
>>>>>>> negative case; there was an earlier optimization for the positive
>>>>>>> case by adding a cache to sidtab_context_to_sid() prior to
>>>>>>> calling it. It's a reverse lookup in the sidtab.
>>>>>> This seems like a bad idea.
>>>>> Not sure what you mean, but it can certainly be changed to at least
>>>>> use a hash table for these reverse lookups.
>>>>>
>>>>>
>>
>>
>>
Thanks for reply and discussion.
I think docker container is only a case, Is it possible there is a similar way, through some means of attack, triggered a constantly increasing of SIDs list, eventually leading to the system panic?

I think the issue is that is takes too long to search SID node when SIDs list too large,
If can optimize the node's data structure(ie : tree structure) or search algorithm to ensure that traversing all nodes can be very short time even in many nodes, maybe it can solve the problem.
Or, in sidtab.c provides "delete_sidtab_node" interface, when umount fs, delete the SID node. Because when fs is umounted, the SID is useless, could delete it to control the size of SIDs list.

Thanks for reading and looking forward to your reply.

Best wishes!

2017-12-15 13:55:59

by Stephen Smalley

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote:
> On 12/15/2017 10:31 PM, yangjihong wrote:
> > On 12/14/2017 12:42 PM, Casey Schaufler wrote:
> > > On 12/14/2017 9:15 AM, Stephen Smalley wrote:
> > > > On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
> > > > > On 12/14/2017 8:42 AM, Stephen Smalley wrote:
> > > > > > On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
> > > > > > > On 12/13/2017 7:18 AM, Stephen Smalley wrote:
> > > > > > > > On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I am doing stressing testing on 3.10 kernel(centos
> > > > > > > > > 7.4), to 
> > > > > > > > > constantly starting numbers of docker ontainers with
> > > > > > > > > selinux 
> > > > > > > > > enabled, and after about 2 days, the kernel
> > > > > > > > > softlockup panic:
> > > > > > > > >   <IRQ>  [<ffffffff810bb778>]
> > > > > > > > > sched_show_task+0xb8/0x120
> > > > > > > > >   [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
> > > > > > > > >   [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
> > > > > > > > >   [<ffffffff811224d0>] ?
> > > > > > > > > watchdog_enable_all_cpus.part.4+0x40/0x40
> > > > > > > > >   [<ffffffff810abf82>]
> > > > > > > > > __hrtimer_run_queues+0xd2/0x260
> > > > > > > > >   [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
> > > > > > > > >   [<ffffffff8104a477>]
> > > > > > > > > local_apic_timer_interrupt+0x37/0x60
> > > > > > > > >   [<ffffffff8166fd90>]
> > > > > > > > > smp_apic_timer_interrupt+0x50/0x140
> > > > > > > > >   [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
> > > > > > > > >   <EOI>  [<ffffffff812b4193>] ?
> > > > > > > > > sidtab_context_to_sid+0xb3/0x480
> > > > > > > > >   [<ffffffff812b41f0>] ?
> > > > > > > > > sidtab_context_to_sid+0x110/0x480
> > > > > > > > >   [<ffffffff812c0d15>] ?
> > > > > > > > > mls_setup_user_range+0x145/0x250
> > > > > > > > >   [<ffffffff812bd477>]
> > > > > > > > > security_get_user_sids+0x3f7/0x550
> > > > > > > > >   [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
> > > > > > > > >   [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
> > > > > > > > >   [<ffffffff812b01d8>]
> > > > > > > > > selinux_transaction_write+0x48/0x80
> > > > > > > > >   [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
> > > > > > > > >   [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
> > > > > > > > >   [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
> > > > > > > > >
> > > > > > > > > My opinion:
> > > > > > > > > when the docker container starts, it would mount
> > > > > > > > > overlay 
> > > > > > > > > filesystem with different selinux context, mount
> > > > > > > > > point such as:
> > > > > > > > > overlay on
> > > > > > > > > /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
> > > > > > > > > e4f6cb0f
> > > > > > > > > 07b4
> > > > > > > > > bc32
> > > > > > > > > 6cb07495ca08fc9ddb66/merged type overlay
> > > > > > > > > (rw,relatime,context="system_u:object_r:svirt_sandbox
> > > > > > > > > _file_t:
> > > > > > > > > s0:c
> > > > > > > > > 414,
> > > > > > > > > c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
> > > > > > > > > 5CFWLADP
> > > > > > > > > ARHH
> > > > > > > > > WY7:
> > > > > > > > > /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
> > > > > > > > > :/var/li
> > > > > > > > > b/do
> > > > > > > > > cker
> > > > > > > > > /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
> > > > > > > > > lib/dock
> > > > > > > > > er/o
> > > > > > > > > verl
> > > > > > > > > ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
> > > > > > > > > 495ca08f
> > > > > > > > > c9dd
> > > > > > > > > b66/
> > > > > > > > > diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
> > > > > > > > > c4530e0e
> > > > > > > > > 952e
> > > > > > > > > ae4f
> > > > > > > > > 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
> > > > > > > > > shm on
> > > > > > > > > /var/lib/docker/containers/9fd65e177d2132011d7b422755
> > > > > > > > > 793449c9
> > > > > > > > > 1327
> > > > > > > > > ca57
> > > > > > > > > 7b8f5d9d6a4adf218d4876/shm type tmpfs 
> > > > > > > > > (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
> > > > > > > > > ject_r:s
> > > > > > > > > virt
> > > > > > > > > _san
> > > > > > > > > dbox_file_t:s0:c414,c873",size=65536k)
> > > > > > > > > overlay on
> > > > > > > > > /var/lib/docker/overlay2/38d1544d080145c7d76150530d02
> > > > > > > > > 55991dfb
> > > > > > > > > 7258
> > > > > > > > > cbca
> > > > > > > > > 14ff6d165b94353eefab/merged type overlay
> > > > > > > > > (rw,relatime,context="system_u:object_r:svirt_sandbox
> > > > > > > > > _file_t:
> > > > > > > > > s0:c
> > > > > > > > > 431,
> > > > > > > > > c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
> > > > > > > > > B7ANVRHP
> > > > > > > > > AVRC
> > > > > > > > > RSS:
> > > > > > > > > /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
> > > > > > > > > ,upperdi
> > > > > > > > > r=/v
> > > > > > > > > ar/l
> > > > > > > > > ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
> > > > > > > > > fb7258cb
> > > > > > > > > ca14
> > > > > > > > > ff6d
> > > > > > > > > 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
> > > > > > > > > 38d1544d
> > > > > > > > > 0801
> > > > > > > > > 45c7
> > > > > > > > > d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
> > > > > > > > > )
> > > > > > > > > shm on
> > > > > > > > > /var/lib/docker/containers/662e7f798fc08b09eae0f0f944
> > > > > > > > > 537a4bce
> > > > > > > > > dc1d
> > > > > > > > > cf05
> > > > > > > > > a65866458523ffd4a71614/shm type tmpfs 
> > > > > > > > > (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
> > > > > > > > > ject_r:s
> > > > > > > > > virt
> > > > > > > > > _san
> > > > > > > > > dbox_file_t:s0:c431,c651",size=65536k)
> > > > > > > > >
> > > > > > > > > sidtab_search_context check the context whether is in
> > > > > > > > > the sidtab 
> > > > > > > > > list, If not found, a new node is generated and
> > > > > > > > > insert into the 
> > > > > > > > > list, As the number of containers is
> > > > > > > > > increasing,  context nodes 
> > > > > > > > > are also more and more, we tested the final number of
> > > > > > > > > nodes 
> > > > > > > > > reached
> > > > > > > > > 300,000 +,
> > > > > > > > > sidtab_context_to_sid runtime needs 100-200ms, which
> > > > > > > > > will lead 
> > > > > > > > > to the system softlockup.
> > > > > > > > >
> > > > > > > > > Is this a selinux bug? When filesystem umount, why
> > > > > > > > > context node 
> > > > > > > > > is not deleted?  I cannot find the relevant function
> > > > > > > > > to delete 
> > > > > > > > > the node in sidtab.c
> > > > > > > > >
> > > > > > > > > Thanks for reading and looking forward to your reply.
> > > > > > > >
> > > > > > > > So, does docker just keep allocating a unique category
> > > > > > > > set for 
> > > > > > > > every new container, never reusing them even if the
> > > > > > > > container is 
> > > > > > > > destroyed?
> > > > > > > > That would be a bug in docker IMHO.  Or are you
> > > > > > > > creating an 
> > > > > > > > unbounded number of containers and never destroying the
> > > > > > > > older 
> > > > > > > > ones?
> > > > > > >
> > > > > > > You can't reuse the security context. A process in
> > > > > > > ContainerA 
> > > > > > > sends a labeled packet to MachineB. ContainerA goes away
> > > > > > > and its 
> > > > > > > context is recycled in ContainerC. MachineB responds some
> > > > > > > time 
> > > > > > > later, again with a labeled packet. ContainerC gets
> > > > > > > information 
> > > > > > > intended for ContainerA, and uses the information to take
> > > > > > > over the 
> > > > > > > Elbonian government.
> > > > > >
> > > > > > Docker isn't using labeled networking (nor is anything else
> > > > > > by 
> > > > > > default; it is only enabled if explicitly configured).
> > > > >
> > > > > If labeled networking weren't an issue we'd have full
> > > > > security 
> > > > > module stacking by now. Yes, it's an edge case. If you want
> > > > > to use 
> > > > > labeled NFS or a local filesystem that gets mounted in each 
> > > > > container (don't tell me that nobody would do that) you've
> > > > > got the 
> > > > > same problem.
> > > >
> > > > Even if someone were to configure labeled networking, Docker is
> > > > not 
> > > > presently relying on that or SELinux network enforcement for
> > > > any 
> > > > security properties, so it really doesn't matter.
> > >
> > > True enough. I can imagine a use case, but as you point out, it
> > > would 
> > > be a very complex configuration and coordination exercise using 
> > > SELinux.
> > >
> > > > And if they wanted
> > > > to do that, they'd have to coordinate category assignments
> > > > across all 
> > > > systems involved, for which no facility exists AFAIK.  If you
> > > > have 
> > > > two docker instances running on different hosts, I'd wager that
> > > > they 
> > > > can hand out the same category sets today to different
> > > > containers.
> > > >
> > > > With respect to labeled NFS, that's also not the default for
> > > > nfs 
> > > > mounts, so again it is a custom configuration and Docker isn't 
> > > > relying on it for any guarantees today.  For local filesystems,
> > > > they 
> > > > would normally be context-mounted or using genfscon rather
> > > > than 
> > > > xattrs in order to be accessible to the container, thus no
> > > > persistent 
> > > > storage of the category sets.
> >
> > Well Kubernetes and OpenShift do set the labels to be the same
> > within a project, and they can manage across nodes.  But yes we are
> > not using labeled networking at this point.
> > > I know that is the intended configuration, but I see people do
> > > all 
> > > sorts of stoopid things for what they believe are good reasons.
> > > Unfortunately, lots of people count on containers to provide 
> > > isolation, but create "solutions" for data sharing that defeat
> > > it.
> > >
> > > > Certainly docker could provide an option to not reuse category
> > > > sets, 
> > > > but making that the default is not sane and just guarantees 
> > > > exhaustion of the SID and context space (just create and tear
> > > > down 
> > > > lots of containers every day or more frequently).
> > >
> > > It seems that Docker might have a similar issue with UIDs, but
> > > it 
> > > takes longer to run out of UIDs than sidtab entries.
> > >
> > > > > > > > On the selinux userspace side, we'd also like to
> > > > > > > > eliminate the 
> > > > > > > > use of /sys/fs/selinux/user (sel_write_user -> 
> > > > > > > > security_get_user_sids) entirely, which is what
> > > > > > > > triggered this 
> > > > > > > > for you.
> > > > > > > >
> > > > > > > > We cannot currently delete a sidtab node because we
> > > > > > > > have no way 
> > > > > > > > of knowing if there are any lingering references to the
> > > > > > > > SID.  
> > > > > > > > Fixing that would require reference-counted SIDs, which
> > > > > > > > goes 
> > > > > > > > beyond just SELinux since SIDs/secids are returned by
> > > > > > > > LSM hooks 
> > > > > > > > and cached in other kernel data structures.
> > > > > > >
> > > > > > > You could delete a sidtab node. The code already deals
> > > > > > > with 
> > > > > > > unfindable SIDs. The issue is that eventually you run out
> > > > > > > of SIDs. 
> > > > > > > Then you are forced to recycle SIDs, which leads to the
> > > > > > > overthrow 
> > > > > > > of the Elbonian government.
> > > > > >
> > > > > > We don't know when we can safely delete a sidtab node since
> > > > > > SIDs 
> > > > > > aren't reference counted and we can't know whether it is
> > > > > > still in 
> > > > > > use somewhere in the kernel.  Doing so prematurely would
> > > > > > lead to 
> > > > > > the SID being remapped to the unlabeled context, and then
> > > > > > likely to 
> > > > > > undesired denials.
> > > > >
> > > > > I would suggest that if you delete a sidtab node and someone
> > > > > comes 
> > > > > along later and tries to use it that denial is exactly what
> > > > > you 
> > > > > would desire. I don't see any other rational action.
> > > >
> > > > Yes, if we know that the SID wasn't in use at the time we tore
> > > > it down.
> > > >   But if we're just randomly deleting sidtab entries based on
> > > > age or 
> > > > something (since we have no reference count), we'll almost
> > > > certainly 
> > > > encounter situations where a SID hasn't been accessed in a long
> > > > time 
> > > > but is still being legitimately cached somewhere.  Just a file
> > > > that 
> > > > hasn't been accessed in a while might have that SID still
> > > > cached in 
> > > > its inode security blob, or anywhere else.
> > > >
> > > > > > > > sidtab_search_context() could no doubt be optimized for
> > > > > > > > the 
> > > > > > > > negative case; there was an earlier optimization for
> > > > > > > > the positive 
> > > > > > > > case by adding a cache to sidtab_context_to_sid() prior
> > > > > > > > to 
> > > > > > > > calling it.  It's a reverse lookup in the sidtab.
> > > > > > >
> > > > > > > This seems like a bad idea.
> > > > > >
> > > > > > Not sure what you mean, but it can certainly be changed to
> > > > > > at least 
> > > > > > use a hash table for these reverse lookups.
> > > > > >
> > > > > >
> > >
> > >
> > >
>
> Thanks for reply and discussion.
> I think docker container is only a case, Is it possible there is a
> similar way, through some means of attack, triggered a constantly
> increasing of  SIDs list, eventually leading to the system panic?
>
> I think the issue is that is takes too long to search SID node when
> SIDs list too large, 
> If can optimize the node's data structure(ie : tree structure) or
> search algorithm to ensure that traversing all nodes can be very
> short time even in many nodes, maybe it can solve the problem.
> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
> fs, delete the SID node. Because when fs is umounted, the SID is
> useless, could delete it to control the size of SIDs list.
>
> Thanks for reading and looking forward to your reply.

We cannot safely delete entries in the sidtab without first adding
reference counting of SIDs, which goes beyond just SELinux since they
are cached in other kernel data structures and returned by LSM hooks.
That's a non-trivial undertaking.

Far more practical in the near term would be to introduce a hash table
or other mechanism for efficient reverse lookups in the sidtab. Are
you offering to implement that or just requesting it?

Independent of that, docker should support reuse of category sets when
containers are deleted, at least as an option and probably as the
default.

2017-12-15 14:50:53

by Daniel Walsh

[permalink] [raw]
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

On 12/15/2017 08:56 AM, Stephen Smalley wrote:
> On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote:
>> On 12/15/2017 10:31 PM, yangjihong wrote:
>>> On 12/14/2017 12:42 PM, Casey Schaufler wrote:
>>>> On 12/14/2017 9:15 AM, Stephen Smalley wrote:
>>>>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
>>>>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
>>>>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>>>>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>>>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I am doing stressing testing on 3.10 kernel(centos
>>>>>>>>>> 7.4), to
>>>>>>>>>> constantly starting numbers of docker ontainers with
>>>>>>>>>> selinux
>>>>>>>>>> enabled, and after about 2 days, the kernel
>>>>>>>>>> softlockup panic:
>>>>>>>>>> <IRQ> [<ffffffff810bb778>]
>>>>>>>>>> sched_show_task+0xb8/0x120
>>>>>>>>>> [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>>>>>>> [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>>>>>>> [<ffffffff811224d0>] ?
>>>>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>>>>>>> [<ffffffff810abf82>]
>>>>>>>>>> __hrtimer_run_queues+0xd2/0x260
>>>>>>>>>> [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>>>>>>> [<ffffffff8104a477>]
>>>>>>>>>> local_apic_timer_interrupt+0x37/0x60
>>>>>>>>>> [<ffffffff8166fd90>]
>>>>>>>>>> smp_apic_timer_interrupt+0x50/0x140
>>>>>>>>>> [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>>>>>>> <EOI> [<ffffffff812b4193>] ?
>>>>>>>>>> sidtab_context_to_sid+0xb3/0x480
>>>>>>>>>> [<ffffffff812b41f0>] ?
>>>>>>>>>> sidtab_context_to_sid+0x110/0x480
>>>>>>>>>> [<ffffffff812c0d15>] ?
>>>>>>>>>> mls_setup_user_range+0x145/0x250
>>>>>>>>>> [<ffffffff812bd477>]
>>>>>>>>>> security_get_user_sids+0x3f7/0x550
>>>>>>>>>> [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>>>>>>> [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>>>>>>> [<ffffffff812b01d8>]
>>>>>>>>>> selinux_transaction_write+0x48/0x80
>>>>>>>>>> [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>>>>>>> [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>>>>>>> [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>>>>>>>
>>>>>>>>>> My opinion:
>>>>>>>>>> when the docker container starts, it would mount
>>>>>>>>>> overlay
>>>>>>>>>> filesystem with different selinux context, mount
>>>>>>>>>> point such as:
>>>>>>>>>> overlay on
>>>>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
>>>>>>>>>> e4f6cb0f
>>>>>>>>>> 07b4
>>>>>>>>>> bc32
>>>>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox
>>>>>>>>>> _file_t:
>>>>>>>>>> s0:c
>>>>>>>>>> 414,
>>>>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
>>>>>>>>>> 5CFWLADP
>>>>>>>>>> ARHH
>>>>>>>>>> WY7:
>>>>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
>>>>>>>>>> :/var/li
>>>>>>>>>> b/do
>>>>>>>>>> cker
>>>>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
>>>>>>>>>> lib/dock
>>>>>>>>>> er/o
>>>>>>>>>> verl
>>>>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
>>>>>>>>>> 495ca08f
>>>>>>>>>> c9dd
>>>>>>>>>> b66/
>>>>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
>>>>>>>>>> c4530e0e
>>>>>>>>>> 952e
>>>>>>>>>> ae4f
>>>>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>>>>>>>> shm on
>>>>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755
>>>>>>>>>> 793449c9
>>>>>>>>>> 1327
>>>>>>>>>> ca57
>>>>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
>>>>>>>>>> ject_r:s
>>>>>>>>>> virt
>>>>>>>>>> _san
>>>>>>>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>>>>>>>> overlay on
>>>>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d02
>>>>>>>>>> 55991dfb
>>>>>>>>>> 7258
>>>>>>>>>> cbca
>>>>>>>>>> 14ff6d165b94353eefab/merged type overlay
>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox
>>>>>>>>>> _file_t:
>>>>>>>>>> s0:c
>>>>>>>>>> 431,
>>>>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
>>>>>>>>>> B7ANVRHP
>>>>>>>>>> AVRC
>>>>>>>>>> RSS:
>>>>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
>>>>>>>>>> ,upperdi
>>>>>>>>>> r=/v
>>>>>>>>>> ar/l
>>>>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
>>>>>>>>>> fb7258cb
>>>>>>>>>> ca14
>>>>>>>>>> ff6d
>>>>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
>>>>>>>>>> 38d1544d
>>>>>>>>>> 0801
>>>>>>>>>> 45c7
>>>>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
>>>>>>>>>> )
>>>>>>>>>> shm on
>>>>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944
>>>>>>>>>> 537a4bce
>>>>>>>>>> dc1d
>>>>>>>>>> cf05
>>>>>>>>>> a65866458523ffd4a71614/shm type tmpfs
>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
>>>>>>>>>> ject_r:s
>>>>>>>>>> virt
>>>>>>>>>> _san
>>>>>>>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>>>>>>>
>>>>>>>>>> sidtab_search_context check the context whether is in
>>>>>>>>>> the sidtab
>>>>>>>>>> list, If not found, a new node is generated and
>>>>>>>>>> insert into the
>>>>>>>>>> list, As the number of containers is
>>>>>>>>>> increasing, context nodes
>>>>>>>>>> are also more and more, we tested the final number of
>>>>>>>>>> nodes
>>>>>>>>>> reached
>>>>>>>>>> 300,000 +,
>>>>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which
>>>>>>>>>> will lead
>>>>>>>>>> to the system softlockup.
>>>>>>>>>>
>>>>>>>>>> Is this a selinux bug? When filesystem umount, why
>>>>>>>>>> context node
>>>>>>>>>> is not deleted? I cannot find the relevant function
>>>>>>>>>> to delete
>>>>>>>>>> the node in sidtab.c
>>>>>>>>>>
>>>>>>>>>> Thanks for reading and looking forward to your reply.
>>>>>>>>> So, does docker just keep allocating a unique category
>>>>>>>>> set for
>>>>>>>>> every new container, never reusing them even if the
>>>>>>>>> container is
>>>>>>>>> destroyed?
>>>>>>>>> That would be a bug in docker IMHO. Or are you
>>>>>>>>> creating an
>>>>>>>>> unbounded number of containers and never destroying the
>>>>>>>>> older
>>>>>>>>> ones?
>>>>>>>> You can't reuse the security context. A process in
>>>>>>>> ContainerA
>>>>>>>> sends a labeled packet to MachineB. ContainerA goes away
>>>>>>>> and its
>>>>>>>> context is recycled in ContainerC. MachineB responds some
>>>>>>>> time
>>>>>>>> later, again with a labeled packet. ContainerC gets
>>>>>>>> information
>>>>>>>> intended for ContainerA, and uses the information to take
>>>>>>>> over the
>>>>>>>> Elbonian government.
>>>>>>> Docker isn't using labeled networking (nor is anything else
>>>>>>> by
>>>>>>> default; it is only enabled if explicitly configured).
>>>>>> If labeled networking weren't an issue we'd have full
>>>>>> security
>>>>>> module stacking by now. Yes, it's an edge case. If you want
>>>>>> to use
>>>>>> labeled NFS or a local filesystem that gets mounted in each
>>>>>> container (don't tell me that nobody would do that) you've
>>>>>> got the
>>>>>> same problem.
>>>>> Even if someone were to configure labeled networking, Docker is
>>>>> not
>>>>> presently relying on that or SELinux network enforcement for
>>>>> any
>>>>> security properties, so it really doesn't matter.
>>>> True enough. I can imagine a use case, but as you point out, it
>>>> would
>>>> be a very complex configuration and coordination exercise using
>>>> SELinux.
>>>>
>>>>> And if they wanted
>>>>> to do that, they'd have to coordinate category assignments
>>>>> across all
>>>>> systems involved, for which no facility exists AFAIK. If you
>>>>> have
>>>>> two docker instances running on different hosts, I'd wager that
>>>>> they
>>>>> can hand out the same category sets today to different
>>>>> containers.
>>>>>
>>>>> With respect to labeled NFS, that's also not the default for
>>>>> nfs
>>>>> mounts, so again it is a custom configuration and Docker isn't
>>>>> relying on it for any guarantees today. For local filesystems,
>>>>> they
>>>>> would normally be context-mounted or using genfscon rather
>>>>> than
>>>>> xattrs in order to be accessible to the container, thus no
>>>>> persistent
>>>>> storage of the category sets.
>>> Well Kubernetes and OpenShift do set the labels to be the same
>>> within a project, and they can manage across nodes. But yes we are
>>> not using labeled networking at this point.
>>>> I know that is the intended configuration, but I see people do
>>>> all
>>>> sorts of stoopid things for what they believe are good reasons.
>>>> Unfortunately, lots of people count on containers to provide
>>>> isolation, but create "solutions" for data sharing that defeat
>>>> it.
>>>>
>>>>> Certainly docker could provide an option to not reuse category
>>>>> sets,
>>>>> but making that the default is not sane and just guarantees
>>>>> exhaustion of the SID and context space (just create and tear
>>>>> down
>>>>> lots of containers every day or more frequently).
>>>> It seems that Docker might have a similar issue with UIDs, but
>>>> it
>>>> takes longer to run out of UIDs than sidtab entries.
>>>>
>>>>>>>>> On the selinux userspace side, we'd also like to
>>>>>>>>> eliminate the
>>>>>>>>> use of /sys/fs/selinux/user (sel_write_user ->
>>>>>>>>> security_get_user_sids) entirely, which is what
>>>>>>>>> triggered this
>>>>>>>>> for you.
>>>>>>>>>
>>>>>>>>> We cannot currently delete a sidtab node because we
>>>>>>>>> have no way
>>>>>>>>> of knowing if there are any lingering references to the
>>>>>>>>> SID.
>>>>>>>>> Fixing that would require reference-counted SIDs, which
>>>>>>>>> goes
>>>>>>>>> beyond just SELinux since SIDs/secids are returned by
>>>>>>>>> LSM hooks
>>>>>>>>> and cached in other kernel data structures.
>>>>>>>> You could delete a sidtab node. The code already deals
>>>>>>>> with
>>>>>>>> unfindable SIDs. The issue is that eventually you run out
>>>>>>>> of SIDs.
>>>>>>>> Then you are forced to recycle SIDs, which leads to the
>>>>>>>> overthrow
>>>>>>>> of the Elbonian government.
>>>>>>> We don't know when we can safely delete a sidtab node since
>>>>>>> SIDs
>>>>>>> aren't reference counted and we can't know whether it is
>>>>>>> still in
>>>>>>> use somewhere in the kernel. Doing so prematurely would
>>>>>>> lead to
>>>>>>> the SID being remapped to the unlabeled context, and then
>>>>>>> likely to
>>>>>>> undesired denials.
>>>>>> I would suggest that if you delete a sidtab node and someone
>>>>>> comes
>>>>>> along later and tries to use it that denial is exactly what
>>>>>> you
>>>>>> would desire. I don't see any other rational action.
>>>>> Yes, if we know that the SID wasn't in use at the time we tore
>>>>> it down.
>>>>> But if we're just randomly deleting sidtab entries based on
>>>>> age or
>>>>> something (since we have no reference count), we'll almost
>>>>> certainly
>>>>> encounter situations where a SID hasn't been accessed in a long
>>>>> time
>>>>> but is still being legitimately cached somewhere. Just a file
>>>>> that
>>>>> hasn't been accessed in a while might have that SID still
>>>>> cached in
>>>>> its inode security blob, or anywhere else.
>>>>>
>>>>>>>>> sidtab_search_context() could no doubt be optimized for
>>>>>>>>> the
>>>>>>>>> negative case; there was an earlier optimization for
>>>>>>>>> the positive
>>>>>>>>> case by adding a cache to sidtab_context_to_sid() prior
>>>>>>>>> to
>>>>>>>>> calling it. It's a reverse lookup in the sidtab.
>>>>>>>> This seems like a bad idea.
>>>>>>> Not sure what you mean, but it can certainly be changed to
>>>>>>> at least
>>>>>>> use a hash table for these reverse lookups.
>>>>>>>
>>>>>>>
>>>>
>>>>
>> Thanks for reply and discussion.
>> I think docker container is only a case, Is it possible there is a
>> similar way, through some means of attack, triggered a constantly
>> increasing of SIDs list, eventually leading to the system panic?
>>
>> I think the issue is that is takes too long to search SID node when
>> SIDs list too large,
>> If can optimize the node's data structure(ie : tree structure) or
>> search algorithm to ensure that traversing all nodes can be very
>> short time even in many nodes, maybe it can solve the problem.
>> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
>> fs, delete the SID node. Because when fs is umounted, the SID is
>> useless, could delete it to control the size of SIDs list.
>>
>> Thanks for reading and looking forward to your reply.
> We cannot safely delete entries in the sidtab without first adding
> reference counting of SIDs, which goes beyond just SELinux since they
> are cached in other kernel data structures and returned by LSM hooks.
> That's a non-trivial undertaking.
>
> Far more practical in the near term would be to introduce a hash table
> or other mechanism for efficient reverse lookups in the sidtab. Are
> you offering to implement that or just requesting it?
>
> Independent of that, docker should support reuse of category sets when
> containers are deleted, at least as an option and probably as the
> default.
>
>
Docker does reuse categories of containers that are removed, by default.

2017-12-16 10:28:56

by Yang Jihong

[permalink] [raw]
Subject: 答复: [BUG]kernel softlockup due to sidtab_ search_context run for long time because of t oo many sidtab context node

>On 12/15/2017 08:56 AM, Stephen Smalley wrote:
>> On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote:
>>> On 12/15/2017 10:31 PM, yangjihong wrote:
>>>> On 12/14/2017 12:42 PM, Casey Schaufler wrote:
>>>>> On 12/14/2017 9:15 AM, Stephen Smalley wrote:
>>>>>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
>>>>>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
>>>>>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>>>>>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>>>>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>>>>>>>>>>> constantly starting numbers of docker ontainers with selinux
>>>>>>>>>>> enabled, and after about 2 days, the kernel softlockup panic:
>>>>>>>>>>> <IRQ> [<ffffffff810bb778>]
>>>>>>>>>>> sched_show_task+0xb8/0x120
>>>>>>>>>>> [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>>>>>>>> [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>>>>>>>> [<ffffffff811224d0>] ?
>>>>>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>>>>>>>> [<ffffffff810abf82>]
>>>>>>>>>>> __hrtimer_run_queues+0xd2/0x260
>>>>>>>>>>> [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>>>>>>>> [<ffffffff8104a477>]
>>>>>>>>>>> local_apic_timer_interrupt+0x37/0x60
>>>>>>>>>>> [<ffffffff8166fd90>]
>>>>>>>>>>> smp_apic_timer_interrupt+0x50/0x140
>>>>>>>>>>> [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>>>>>>>> <EOI> [<ffffffff812b4193>] ?
>>>>>>>>>>> sidtab_context_to_sid+0xb3/0x480
>>>>>>>>>>> [<ffffffff812b41f0>] ?
>>>>>>>>>>> sidtab_context_to_sid+0x110/0x480
>>>>>>>>>>> [<ffffffff812c0d15>] ?
>>>>>>>>>>> mls_setup_user_range+0x145/0x250
>>>>>>>>>>> [<ffffffff812bd477>]
>>>>>>>>>>> security_get_user_sids+0x3f7/0x550
>>>>>>>>>>> [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>>>>>>>> [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>>>>>>>> [<ffffffff812b01d8>]
>>>>>>>>>>> selinux_transaction_write+0x48/0x80
>>>>>>>>>>> [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>>>>>>>> [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>>>>>>>> [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>>>>>>>>
>>>>>>>>>>> My opinion:
>>>>>>>>>>> when the docker container starts, it would mount overlay
>>>>>>>>>>> filesystem with different selinux context, mount point such
>>>>>>>>>>> as:
>>>>>>>>>>> overlay on
>>>>>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
>>>>>>>>>>> e4f6cb0f
>>>>>>>>>>> 07b4
>>>>>>>>>>> bc32
>>>>>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox
>>>>>>>>>>> _file_t:
>>>>>>>>>>> s0:c
>>>>>>>>>>> 414,
>>>>>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
>>>>>>>>>>> 5CFWLADP
>>>>>>>>>>> ARHH
>>>>>>>>>>> WY7:
>>>>>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
>>>>>>>>>>> :/var/li
>>>>>>>>>>> b/do
>>>>>>>>>>> cker
>>>>>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
>>>>>>>>>>> lib/dock
>>>>>>>>>>> er/o
>>>>>>>>>>> verl
>>>>>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
>>>>>>>>>>> 495ca08f
>>>>>>>>>>> c9dd
>>>>>>>>>>> b66/
>>>>>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
>>>>>>>>>>> c4530e0e
>>>>>>>>>>> 952e
>>>>>>>>>>> ae4f
>>>>>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>>>>>>>>> shm on
>>>>>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755
>>>>>>>>>>> 793449c9
>>>>>>>>>>> 1327
>>>>>>>>>>> ca57
>>>>>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
>>>>>>>>>>> ject_r:s
>>>>>>>>>>> virt
>>>>>>>>>>> _san
>>>>>>>>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>>>>>>>>> overlay on
>>>>>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d02
>>>>>>>>>>> 55991dfb
>>>>>>>>>>> 7258
>>>>>>>>>>> cbca
>>>>>>>>>>> 14ff6d165b94353eefab/merged type overlay
>>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox
>>>>>>>>>>> _file_t:
>>>>>>>>>>> s0:c
>>>>>>>>>>> 431,
>>>>>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
>>>>>>>>>>> B7ANVRHP
>>>>>>>>>>> AVRC
>>>>>>>>>>> RSS:
>>>>>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
>>>>>>>>>>> ,upperdi
>>>>>>>>>>> r=/v
>>>>>>>>>>> ar/l
>>>>>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
>>>>>>>>>>> fb7258cb
>>>>>>>>>>> ca14
>>>>>>>>>>> ff6d
>>>>>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
>>>>>>>>>>> 38d1544d
>>>>>>>>>>> 0801
>>>>>>>>>>> 45c7
>>>>>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
>>>>>>>>>>> )
>>>>>>>>>>> shm on
>>>>>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944
>>>>>>>>>>> 537a4bce
>>>>>>>>>>> dc1d
>>>>>>>>>>> cf05
>>>>>>>>>>> a65866458523ffd4a71614/shm type tmpfs
>>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
>>>>>>>>>>> ject_r:s
>>>>>>>>>>> virt
>>>>>>>>>>> _san
>>>>>>>>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>>>>>>>>
>>>>>>>>>>> sidtab_search_context check the context whether is in the
>>>>>>>>>>> sidtab list, If not found, a new node is generated and insert
>>>>>>>>>>> into the list, As the number of containers is increasing,
>>>>>>>>>>> context nodes are also more and more, we tested the final
>>>>>>>>>>> number of nodes reached
>>>>>>>>>>> 300,000 +,
>>>>>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will
>>>>>>>>>>> lead to the system softlockup.
>>>>>>>>>>>
>>>>>>>>>>> Is this a selinux bug? When filesystem umount, why context
>>>>>>>>>>> node is not deleted? I cannot find the relevant function to
>>>>>>>>>>> delete the node in sidtab.c
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reading and looking forward to your reply.
>>>>>>>>>> So, does docker just keep allocating a unique category set for
>>>>>>>>>> every new container, never reusing them even if the container
>>>>>>>>>> is destroyed?
>>>>>>>>>> That would be a bug in docker IMHO. Or are you creating an
>>>>>>>>>> unbounded number of containers and never destroying the older
>>>>>>>>>> ones?
>>>>>>>>> You can't reuse the security context. A process in ContainerA
>>>>>>>>> sends a labeled packet to MachineB. ContainerA goes away and
>>>>>>>>> its context is recycled in ContainerC. MachineB responds some
>>>>>>>>> time later, again with a labeled packet. ContainerC gets
>>>>>>>>> information intended for ContainerA, and uses the information
>>>>>>>>> to take over the Elbonian government.
>>>>>>>> Docker isn't using labeled networking (nor is anything else by
>>>>>>>> default; it is only enabled if explicitly configured).
>>>>>>> If labeled networking weren't an issue we'd have full security
>>>>>>> module stacking by now. Yes, it's an edge case. If you want to
>>>>>>> use labeled NFS or a local filesystem that gets mounted in each
>>>>>>> container (don't tell me that nobody would do that) you've got
>>>>>>> the same problem.
>>>>>> Even if someone were to configure labeled networking, Docker is
>>>>>> not presently relying on that or SELinux network enforcement for
>>>>>> any security properties, so it really doesn't matter.
>>>>> True enough. I can imagine a use case, but as you point out, it
>>>>> would be a very complex configuration and coordination exercise
>>>>> using SELinux.
>>>>>
>>>>>> And if they wanted
>>>>>> to do that, they'd have to coordinate category assignments across
>>>>>> all systems involved, for which no facility exists AFAIK. If you
>>>>>> have two docker instances running on different hosts, I'd wager
>>>>>> that they can hand out the same category sets today to different
>>>>>> containers.
>>>>>>
>>>>>> With respect to labeled NFS, that's also not the default for nfs
>>>>>> mounts, so again it is a custom configuration and Docker isn't
>>>>>> relying on it for any guarantees today. For local filesystems,
>>>>>> they would normally be context-mounted or using genfscon rather
>>>>>> than xattrs in order to be accessible to the container, thus no
>>>>>> persistent storage of the category sets.
>>>> Well Kubernetes and OpenShift do set the labels to be the same
>>>> within a project, and they can manage across nodes. But yes we are
>>>> not using labeled networking at this point.
>>>>> I know that is the intended configuration, but I see people do all
>>>>> sorts of stoopid things for what they believe are good reasons.
>>>>> Unfortunately, lots of people count on containers to provide
>>>>> isolation, but create "solutions" for data sharing that defeat it.
>>>>>
>>>>>> Certainly docker could provide an option to not reuse category
>>>>>> sets, but making that the default is not sane and just guarantees
>>>>>> exhaustion of the SID and context space (just create and tear down
>>>>>> lots of containers every day or more frequently).
>>>>> It seems that Docker might have a similar issue with UIDs, but it
>>>>> takes longer to run out of UIDs than sidtab entries.
>>>>>
>>>>>>>>>> On the selinux userspace side, we'd also like to eliminate the
>>>>>>>>>> use of /sys/fs/selinux/user (sel_write_user ->
>>>>>>>>>> security_get_user_sids) entirely, which is what triggered this
>>>>>>>>>> for you.
>>>>>>>>>>
>>>>>>>>>> We cannot currently delete a sidtab node because we have no
>>>>>>>>>> way of knowing if there are any lingering references to the
>>>>>>>>>> SID.
>>>>>>>>>> Fixing that would require reference-counted SIDs, which goes
>>>>>>>>>> beyond just SELinux since SIDs/secids are returned by LSM
>>>>>>>>>> hooks and cached in other kernel data structures.
>>>>>>>>> You could delete a sidtab node. The code already deals with
>>>>>>>>> unfindable SIDs. The issue is that eventually you run out of
>>>>>>>>> SIDs.
>>>>>>>>> Then you are forced to recycle SIDs, which leads to the
>>>>>>>>> overthrow of the Elbonian government.
>>>>>>>> We don't know when we can safely delete a sidtab node since SIDs
>>>>>>>> aren't reference counted and we can't know whether it is still
>>>>>>>> in use somewhere in the kernel. Doing so prematurely would lead
>>>>>>>> to the SID being remapped to the unlabeled context, and then
>>>>>>>> likely to undesired denials.
>>>>>>> I would suggest that if you delete a sidtab node and someone
>>>>>>> comes along later and tries to use it that denial is exactly what
>>>>>>> you would desire. I don't see any other rational action.
>>>>>> Yes, if we know that the SID wasn't in use at the time we tore it
>>>>>> down.
>>>>>> But if we're just randomly deleting sidtab entries based on age
>>>>>> or something (since we have no reference count), we'll almost
>>>>>> certainly encounter situations where a SID hasn't been accessed in
>>>>>> a long time but is still being legitimately cached somewhere.
>>>>>> Just a file that hasn't been accessed in a while might have that
>>>>>> SID still cached in its inode security blob, or anywhere else.
>>>>>>
>>>>>>>>>> sidtab_search_context() could no doubt be optimized for the
>>>>>>>>>> negative case; there was an earlier optimization for the
>>>>>>>>>> positive case by adding a cache to sidtab_context_to_sid()
>>>>>>>>>> prior to calling it. It's a reverse lookup in the sidtab.
>>>>>>>>> This seems like a bad idea.
>>>>>>>> Not sure what you mean, but it can certainly be changed to at
>>>>>>>> least use a hash table for these reverse lookups.
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>> Thanks for reply and discussion.
>>> I think docker container is only a case, Is it possible there is a
>>> similar way, through some means of attack, triggered a constantly
>>> increasing of SIDs list, eventually leading to the system panic?
>>>
>>> I think the issue is that is takes too long to search SID node when
>>> SIDs list too large, If can optimize the node's data structure(ie :
>>> tree structure) or search algorithm to ensure that traversing all
>>> nodes can be very short time even in many nodes, maybe it can solve
>>> the problem.
>>> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
>>> fs, delete the SID node. Because when fs is umounted, the SID is
>>> useless, could delete it to control the size of SIDs list.
>>>
>>> Thanks for reading and looking forward to your reply.
>> We cannot safely delete entries in the sidtab without first adding
>> reference counting of SIDs, which goes beyond just SELinux since they
>> are cached in other kernel data structures and returned by LSM hooks.
>> That's a non-trivial undertaking.
>>
>> Far more practical in the near term would be to introduce a hash table
>> or other mechanism for efficient reverse lookups in the sidtab. Are
>> you offering to implement that or just requesting it?
>>
Because I'm not very familiar with the overall architecture of selinux, so may be could not offer to implement, sorry.
Or please tell me what I can do if I can help.
If there is any progress(ie determine the solution or optimization method), could you please inform me about it? thanks!

>> Independent of that, docker should support reuse of category sets when
>> containers are deleted, at least as an option and probably as the
>> default.
>>
>>
>Docker does reuse categories of containers that are removed, by default.

Thanks for reading and looking forward to your reply.
Best wishes!