I'm seeing a race during policy load while the "regular" sidtab
conversion is happening and a live conversion starts to take place in
sidtab_context_to_sid().
We have an initial policy that's loaded by systemd ~0.6s into boot and
then another policy gets loaded ~2-3s into boot. That second policy load
is what hits the race condition situation because the sidtab is only
partially populated and there's a decent amount of filesystem operations
happening, at the same time, which are triggering live conversions.
[ 3.091910] Unable to handle kernel paging request at virtual address 001303e1aa140408
[ 3.100083] Mem abort info:
[ 3.102963] ESR = 0x96000004
[ 3.102965] EC = 0x25: DABT (current EL), IL = 32 bits
[ 3.102967] SET = 0, FnV = 0
[ 3.102968] EA = 0, S1PTW = 0
[ 3.102969] Data abort info:
[ 3.102970] ISV = 0, ISS = 0x00000004
[ 3.102971] CM = 0, WnR = 0
[ 3.102973] [001303e1aa140408] address between user and kernel address ranges
[ 3.102977] Internal error: Oops: 96000004 [#1] SMP
[ 3.102981] Modules linked in:
[ 3.111250] bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
[ 3.111259] CPU: 0 PID: 529 Comm: (tservice) Tainted: G O 5.10.17.1 #1
[ 3.119881] Hardware name: redacted (DT)
[ 3.119884] pstate: 20400085 (nzCv daIf +PAN -UAO -TCO BTYPE=--)
[ 3.119898] pc : sidtab_do_lookup (/usr/src/kernel/security/selinux/ss/sidtab.c:202)
[ 3.119902] lr : sidtab_context_to_sid (/usr/src/kernel/security/selinux/ss/sidtab.c:312)
[ 3.126105] sp : ffff800011ceb810
[ 3.126106] x29: ffff800011ceb810 x28: 0000000000000000
[ 3.126108] x27: 0000000000000005 x26: ffffda109f3f2000
[ 3.126110] x25: 00000000ffffffff x24: 0000000000000000
[ 3.126113] x23: 0000000000000001
[ 3.133124] x22: 0000000000000005
[ 3.133125] x21: aa1303e1aa140408 x20: 0000000000000001
[ 3.133127] x19: 00000000000000cc x18: 0000000000000003
[ 3.133128] x17: 000000000000003e x16: 000000000000003f
[ 3.145519] x15: 0000000000000039 x14: 000000000000002e
[ 3.145521] x13: 0000000058294db1 x12: 00000000158294db
[ 3.145523] x11: 000000007f0b3af2 x10: 0000000000004e00
[ 3.145525] x9 : 00000000000000cd x8 : 0000000000000005
[ 3.281289] x7 : feff735e62647764 x6 : 00000000000080cc
[ 3.286769] x5 : 0000000000000005 x4 : ffff3f47c5b20000
[ 3.292249] x3 : ffff800011ceb900 x2 : 0000000000000001
[ 3.297729] x1 : 00000000000000cc x0 : aa1303e1aa1403e0
[ 3.303210] Call trace:
[ 3.305733] sidtab_do_lookup (/usr/src/kernel/security/selinux/ss/sidtab.c:202)
[ 3.309867] sidtab_context_to_sid (/usr/src/kernel/security/selinux/ss/sidtab.c:312)
[ 3.314451] security_context_to_sid_core (/usr/src/kernel/security/selinux/ss/services.c:1557)
[ 3.319661] security_context_to_sid_default (/usr/src/kernel/security/selinux/ss/services.c:1616)
[ 3.324961] inode_doinit_use_xattr (/usr/src/kernel/security/selinux/hooks.c:1366)
[ 3.329634] inode_doinit_with_dentry (/usr/src/kernel/security/selinux/hooks.c:1457)
[ 3.334486] selinux_d_instantiate (/usr/src/kernel/security/selinux/hooks.c:6278)
[ 3.338889] security_d_instantiate (/usr/src/kernel/security/security.c:2004)
[ 3.343385] d_splice_alias (/usr/src/kernel/fs/dcache.c:3030)
[ 3.347251] squashfs_lookup (/usr/src/kernel/fs/squashfs/namei.c:220)
[ 3.385561] el0_sync_handler (/usr/src/kernel/arch/arm64/kernel/entry-common.c:428)
[ 3.389517] el0_sync (/usr/src/kernel/arch/arm64/kernel/entry.S:671)
[ 3.392939] Code: 51002718 340001d7 1ad82768 8b284c15 (f94002a0)
All code
========
0: 18 27 sbb %ah,(%rdi)
2: 00 51 d7 add %dl,-0x29(%rcx)
5: 01 00 add %eax,(%rax)
7: 34 68 xor $0x68,%al
9: 27 (bad)
a: d8 1a fcomps (%rdx)
c:* 15 4c 28 8b a0 adc $0xa08b284c,%eax <-- trapping instruction
11: 02 40 f9 add -0x7(%rax),%al
Code starting with the faulting instruction
===========================================
0: a0 .byte 0xa0
1: 02 40 f9 add -0x7(%rax),%al
[ 3.399230] ---[ end trace cc1840b3ff2c7506 ]---
The corresponding source from sidtab.c:
179 static struct sidtab_entry *sidtab_do_lookup(struct sidtab *s, u32 index,
180 int alloc)
181 {
...
193 /* lookup inside the subtree */
194 entry = &s->roots[level];
195 while (level != 0) {
196 capacity_shift -= SIDTAB_INNER_SHIFT;
197 --level;
198
199 entry = &entry->ptr_inner->entries[leaf_index >> capacity_shift];
200 leaf_index &= ((u32)1 << capacity_shift) - 1;
201
202 if (!entry->ptr_inner) {
203 if (alloc)
204 entry->ptr_inner = kzalloc(SIDTAB_NODE_ALLOC_SIZE,
205 GFP_ATOMIC);
206 if (!entry->ptr_inner)
207 return NULL;
208 }
209 }
210 if (!entry->ptr_leaf) {
211 if (alloc)
212 entry->ptr_leaf = kzalloc(SIDTAB_NODE_ALLOC_SIZE,
213 GFP_ATOMIC);
214 if (!entry->ptr_leaf)
215 return NULL;
216 }
217 return &entry->ptr_leaf->entries[index % SIDTAB_LEAF_ENTRIES];
218 }
...
263 int sidtab_context_to_sid(struct sidtab *s, struct context *context,
264 u32 *sid)
265 {
...
305 /*
306 * if we are building a new sidtab, we need to convert the context
307 * and insert it there as well
308 */
309 if (convert) {
310 rc = -ENOMEM;
311 dst_convert = sidtab_do_lookup(convert->target, count, 1);
312 if (!dst_convert) {
313 context_destroy(&dst->context);
314 goto out_unlock;
315 }
...
What I'm having trouble understanding is how the above call to
sidtab_do_lookup(), on the target sidtab that's undergoing a conversion
in sidtab_convert(), can be expected to work. sidtab_convert_tree() is
allocating and initializing ptr_inner sidtab nodes at the same time
sidtab_do_lookup() is trying to use them with no locking being performed
on the target sidtab.
Ondrej specifically mentions, in commit ee1a84fdfeed ("selinux: overhaul
sidtab to fix bug and improve performance"), that there's no need to
freeze the sidtab during policy reloads so I know that there's thought
given to these code paths running in parallel.
Can someone more knowledgeable on how the sidtab locking is expected to
work suggest a fix for this crash?
Tyler
On 2021-02-23 15:43:48, Tyler Hicks wrote:
> I'm seeing a race during policy load while the "regular" sidtab
> conversion is happening and a live conversion starts to take place in
> sidtab_context_to_sid().
>
> We have an initial policy that's loaded by systemd ~0.6s into boot and
> then another policy gets loaded ~2-3s into boot. That second policy load
> is what hits the race condition situation because the sidtab is only
> partially populated and there's a decent amount of filesystem operations
> happening, at the same time, which are triggering live conversions.
Here are the relevant messages written to the ring buffer showing the initial
policy load, the second policy load, the sidtab conversion starting, and then
the stack trace:
[ 0.112159] SELinux: Initializing.
...
[ 0.624107] audit: type=1404 audit(0.492:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=selinux res=1
[ 0.664063] SELinux: Permission perfmon in class capability2 not defined in policy.
[ 0.664068] SELinux: Permission bpf in class capability2 not defined in policy.
[ 0.664070] SELinux: Permission checkpoint_restore in class capability2 not defined in policy.
[ 0.664077] SELinux: Permission perfmon in class cap2_userns not defined in policy.
[ 0.664079] SELinux: Permission bpf in class cap2_userns not defined in policy.
[ 0.664080] SELinux: Permission checkpoint_restore in class cap2_userns not defined in policy.
[ 0.664114] SELinux: Class perf_event not defined in policy.
[ 0.664115] SELinux: Class lockdown not defined in policy.
[ 0.664117] SELinux: the above unknown classes and permissions will be allowed
[ 0.667863] SELinux: policy capability network_peer_controls=1
[ 0.667866] SELinux: policy capability open_perms=1
[ 0.667867] SELinux: policy capability extended_socket_class=1
[ 0.667868] SELinux: policy capability always_check_network=0
[ 0.667870] SELinux: policy capability cgroup_seclabel=1
[ 0.667871] SELinux: policy capability nnp_nosuid_transition=1
[ 0.667873] SELinux: policy capability genfs_seclabel_symlinks=0
[ 0.712152] audit: type=1403 audit(0.584:3): auid=4294967295 ses=4294967295 lsm=selinux res=1
[ 0.714430] systemd[1]: Successfully loaded SELinux policy in 91.004ms.
...
[ 3.042735] SELinux: Permission perfmon in class capability2 not defined in policy.
[ 3.042741] SELinux: Permission bpf in class capability2 not defined in policy.
[ 3.042743] SELinux: Permission checkpoint_restore in class capability2 not defined in policy.
[ 3.042750] SELinux: Permission perfmon in class cap2_userns not defined in policy.
[ 3.042751] SELinux: Permission bpf in class cap2_userns not defined in policy.
[ 3.042753] SELinux: Permission checkpoint_restore in class cap2_userns not defined in policy.
[ 3.042787] SELinux: Class perf_event not defined in policy.
[ 3.042789] SELinux: Class lockdown not defined in policy.
[ 3.042791] SELinux: the above unknown classes and permissions will be allowed
[ 3.042801] SELinux: Converting 208 SID table entries...
[ 3.044951] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000040000028
[ 3.050919] SELinux: policy capability network_peer_controls=1
[ 3.055990] Mem abort info:
[ 3.055992] ESR = 0x96000004
[ 3.055994] EC = 0x25: DABT (current EL), IL = 32 bits
[ 3.055995] SET = 0, FnV = 0
[ 3.055996] EA = 0, S1PTW = 0
[ 3.055997] Data abort info:
[ 3.055998] ISV = 0, ISS = 0x00000004
[ 3.055999] CM = 0, WnR = 0
[ 3.056002] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000940ea000
[ 3.056003] [0000000040000028] pgd=0000000000000000, p4d=0000000000000000
[ 3.058894] SELinux: policy capability open_perms=1
[ 3.062037]
[ 3.062041] Internal error: Oops: 96000004 [#1] SMP
[ 3.062044] Modules linked in: bnxt_en pcie_iproc_platform
[ 3.067530] SELinux: policy capability extended_socket_class=1
[ 3.070674] pcie_iproc diagbe(O)
[ 3.070680] CPU: 1 PID: 512 Comm: (tservice) Tainted: G O 5.10.17.1 #1
[ 3.070683] Hardware name: redacted (DT)
[ 3.073924] SELinux: policy capability always_check_network=0
[ 3.076890] pstate: 20400085 (nzCv daIf +PAN -UAO -TCO BTYPE=--)
[ 3.076904] pc : sidtab_do_lookup+0x128/0x1b4
[ 3.076908] lr : sidtab_context_to_sid+0x210/0x378
[ 3.080864] SELinux: policy capability cgroup_seclabel=1
[ 3.083920] sp : ffff800011bfbaf0
[ 3.083922] x29: ffff800011bfbaf0
[ 3.090570] SELinux: policy capability nnp_nosuid_transition=1
[ 3.097564] x28: 0000000000000000
[ 3.097566] x27: 0000000000000005 x26: ffffd1b6451f2000
[ 3.097568] x25: 00000000ffffffff x24: 0000000000000000
[ 3.102693] SELinux: policy capability genfs_seclabel_symlinks=0
[ 3.104225] x23: 0000000000000001 x22: 0000000000000005
[ 3.104227] x21: 0000000040000028 x20: 0000000000000001
[ 3.104230] x19: 00000000000000d0 x18: 5b00000000000000
[ 3.217640] x17: 0000000077493ecb x16: 00000000c2b2ae35
[ 3.223120] x15: 00000000639285d3 x14: 0000000000000010
[ 3.228600] x13: 00000000128b8525 x12: 000000005128b852
[ 3.234079] x11: 0000000049dd48c1 x10: 0000000000004e00
[ 3.239559] x9 : 00000000000000d1 x8 : 0000000000000005
[ 3.245039] x7 : 0000000000000001 x6 : 0000000000000000
[ 3.250518] x5 : ffff800011bfbc48 x4 : 0000000000000010
[ 3.255998] x3 : 0000000000000002 x2 : 0000000000000001
[ 3.261477] x1 : 00000000000000d0 x0 : 0000000040000000
[ 3.266957] Call trace:
[ 3.269480] sidtab_do_lookup+0x128/0x1b4
[ 3.273615] sidtab_context_to_sid+0x210/0x378
[ 3.278198] security_compute_sid+0x3f4/0x60c
[ 3.282692] security_transition_sid+0x30/0x38
[ 3.287277] selinux_bprm_creds_for_exec+0x11c/0x2e8
[ 3.292398] security_bprm_creds_for_exec+0x30/0x4c
[ 3.297432] bprm_execve+0x100/0x1d4
[ 3.301118] do_execveat_common+0x1e8/0x228
[ 3.305432] __arm64_sys_execve+0x3c/0x4c
[ 3.309569] do_el0_svc+0xec/0x154
[ 3.313079] el0_svc+0xc/0x14
[ 3.316139] el0_sync_handler+0x7c/0xd8
[ 3.320095] el0_sync+0x144/0x180
[ 3.323516] Code: 51002718 340001d7 1ad82768 8b284c15 (f94002a0)
[ 3.329808] ---[ end trace 69ebb6381d57e49b ]---
Tyler
On 2021-02-23 15:50:56, Tyler Hicks wrote:
> On 2021-02-23 15:43:48, Tyler Hicks wrote:
> > I'm seeing a race during policy load while the "regular" sidtab
> > conversion is happening and a live conversion starts to take place in
> > sidtab_context_to_sid().
> >
> > We have an initial policy that's loaded by systemd ~0.6s into boot and
> > then another policy gets loaded ~2-3s into boot. That second policy load
> > is what hits the race condition situation because the sidtab is only
> > partially populated and there's a decent amount of filesystem operations
> > happening, at the same time, which are triggering live conversions.
Hmm, perhaps this is the same problem that's fixed by Ondrej's proposed
change here:
https://lore.kernel.org/selinux/[email protected]/
I'll put these changes through a validation run (the only place that I
can seem to reproduce this crash) and see how it looks.
Tyler
On Tue, Feb 23, 2021 at 5:36 PM Tyler Hicks <[email protected]> wrote:
> On 2021-02-23 15:50:56, Tyler Hicks wrote:
> > On 2021-02-23 15:43:48, Tyler Hicks wrote:
> > > I'm seeing a race during policy load while the "regular" sidtab
> > > conversion is happening and a live conversion starts to take place in
> > > sidtab_context_to_sid().
> > >
> > > We have an initial policy that's loaded by systemd ~0.6s into boot and
> > > then another policy gets loaded ~2-3s into boot. That second policy load
> > > is what hits the race condition situation because the sidtab is only
> > > partially populated and there's a decent amount of filesystem operations
> > > happening, at the same time, which are triggering live conversions.
>
> Hmm, perhaps this is the same problem that's fixed by Ondrej's proposed
> change here:
>
> https://lore.kernel.org/selinux/[email protected]/
>
> I'll put these changes through a validation run (the only place that I
> can seem to reproduce this crash) and see how it looks.
Thanks, please let us know what you find out.
--
paul moore
http://www.paul-moore.com
On Tue, Feb 23, 2021 at 11:37 PM Tyler Hicks
<[email protected]> wrote:
> On 2021-02-23 15:50:56, Tyler Hicks wrote:
> > On 2021-02-23 15:43:48, Tyler Hicks wrote:
> > > I'm seeing a race during policy load while the "regular" sidtab
> > > conversion is happening and a live conversion starts to take place in
> > > sidtab_context_to_sid().
> > >
> > > We have an initial policy that's loaded by systemd ~0.6s into boot and
> > > then another policy gets loaded ~2-3s into boot. That second policy load
> > > is what hits the race condition situation because the sidtab is only
> > > partially populated and there's a decent amount of filesystem operations
> > > happening, at the same time, which are triggering live conversions.
>
> Hmm, perhaps this is the same problem that's fixed by Ondrej's proposed
> change here:
>
> https://lore.kernel.org/selinux/[email protected]/
>
> I'll put these changes through a validation run (the only place that I
> can seem to reproduce this crash) and see how it looks.
Hm... I think there is actually another race condition introduced by
the switch from rwlock to RCU [1]... Judging from the call trace you
may be hitting that.
Basically, before the switch the sidtab swapover worked like this:
1. Start live conversion of new entries.
2. Convert existing entries.
[Still only the old sidtab is visible to readers here.]
3. Swap sidtab under write lock.
4. Now only the new sidtab is visible to readers, so the old one can
be destroyed.
After the switch to RCU, we now have:
1. Start live conversion of new entries.
2. Convert existing entries.
3. RCU-assign the new policy pointer to selinux_state.
[!!! Now actually both old and new sidtab may be referenced by
readers, since there is no synchronization barrier previously provided
by the write lock.]
4. Wait for synchronize_rcu() to return.
5. Now only the new sidtab is visible to readers, so the old one can
be destroyed.
So the race can happen between 3. and 5., if one thread already sees
the new sidtab and adds a new entry there, and a second thread still
has the reference to the old sidtab and also tires to add a new entry;
live-converting to the new sidtab, which it doesn't expect to change
by itself. Unfortunately I failed to realize this when reviewing the
patch :/
I think the only two options to fix it are A) switching back to
read-write lock (the easy and safe way; undoing the performance
benefits of [1]), or B) implementing a safe two-way live conversion of
new sidtab entries, so that both tables are kept in sync while they
are both available (more complicated and with possible tricky
implications of different interpretations of contexts by the two
policies).
[1] 1b8b31a2e612 ("selinux: convert policy read-write lock to RCU")
--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
On 2021-02-24 10:33:46, Ondrej Mosnacek wrote:
> On Tue, Feb 23, 2021 at 11:37 PM Tyler Hicks
> <[email protected]> wrote:
> > On 2021-02-23 15:50:56, Tyler Hicks wrote:
> > > On 2021-02-23 15:43:48, Tyler Hicks wrote:
> > > > I'm seeing a race during policy load while the "regular" sidtab
> > > > conversion is happening and a live conversion starts to take place in
> > > > sidtab_context_to_sid().
> > > >
> > > > We have an initial policy that's loaded by systemd ~0.6s into boot and
> > > > then another policy gets loaded ~2-3s into boot. That second policy load
> > > > is what hits the race condition situation because the sidtab is only
> > > > partially populated and there's a decent amount of filesystem operations
> > > > happening, at the same time, which are triggering live conversions.
> >
> > Hmm, perhaps this is the same problem that's fixed by Ondrej's proposed
> > change here:
> >
> > https://lore.kernel.org/selinux/[email protected]/
> >
> > I'll put these changes through a validation run (the only place that I
> > can seem to reproduce this crash) and see how it looks.
>
> Hm... I think there is actually another race condition introduced by
> the switch from rwlock to RCU [1]... Judging from the call trace you
> may be hitting that.
I believe your patches above fixed the race I was seeing. I was able to
make it through a full validation run without any crashes. Without those
patches applied, I would see several crashes resulting from this race
over the course of a validation run.
I'll continue to test with your changes and let you know if I end up
running into the other race you spotted.
Tyler
>
> Basically, before the switch the sidtab swapover worked like this:
> 1. Start live conversion of new entries.
> 2. Convert existing entries.
> [Still only the old sidtab is visible to readers here.]
> 3. Swap sidtab under write lock.
> 4. Now only the new sidtab is visible to readers, so the old one can
> be destroyed.
>
> After the switch to RCU, we now have:
> 1. Start live conversion of new entries.
> 2. Convert existing entries.
> 3. RCU-assign the new policy pointer to selinux_state.
> [!!! Now actually both old and new sidtab may be referenced by
> readers, since there is no synchronization barrier previously provided
> by the write lock.]
> 4. Wait for synchronize_rcu() to return.
> 5. Now only the new sidtab is visible to readers, so the old one can
> be destroyed.
>
> So the race can happen between 3. and 5., if one thread already sees
> the new sidtab and adds a new entry there, and a second thread still
> has the reference to the old sidtab and also tires to add a new entry;
> live-converting to the new sidtab, which it doesn't expect to change
> by itself. Unfortunately I failed to realize this when reviewing the
> patch :/
>
> I think the only two options to fix it are A) switching back to
> read-write lock (the easy and safe way; undoing the performance
> benefits of [1]), or B) implementing a safe two-way live conversion of
> new sidtab entries, so that both tables are kept in sync while they
> are both available (more complicated and with possible tricky
> implications of different interpretations of contexts by the two
> policies).
>
> [1] 1b8b31a2e612 ("selinux: convert policy read-write lock to RCU")
>
> --
> Ondrej Mosnacek
> Software Engineer, Linux Security - SELinux kernel
> Red Hat, Inc.
>
On Wed, Feb 24, 2021 at 3:43 PM Tyler Hicks <[email protected]> wrote:
> On 2021-02-24 10:33:46, Ondrej Mosnacek wrote:
> > On Tue, Feb 23, 2021 at 11:37 PM Tyler Hicks
> > <[email protected]> wrote:
> > > On 2021-02-23 15:50:56, Tyler Hicks wrote:
> > > > On 2021-02-23 15:43:48, Tyler Hicks wrote:
> > > > > I'm seeing a race during policy load while the "regular" sidtab
> > > > > conversion is happening and a live conversion starts to take place in
> > > > > sidtab_context_to_sid().
> > > > >
> > > > > We have an initial policy that's loaded by systemd ~0.6s into boot and
> > > > > then another policy gets loaded ~2-3s into boot. That second policy load
> > > > > is what hits the race condition situation because the sidtab is only
> > > > > partially populated and there's a decent amount of filesystem operations
> > > > > happening, at the same time, which are triggering live conversions.
> > >
> > > Hmm, perhaps this is the same problem that's fixed by Ondrej's proposed
> > > change here:
> > >
> > > https://lore.kernel.org/selinux/[email protected]/
> > >
> > > I'll put these changes through a validation run (the only place that I
> > > can seem to reproduce this crash) and see how it looks.
> >
> > Hm... I think there is actually another race condition introduced by
> > the switch from rwlock to RCU [1]... Judging from the call trace you
> > may be hitting that.
>
> I believe your patches above fixed the race I was seeing. I was able to
> make it through a full validation run without any crashes. Without those
> patches applied, I would see several crashes resulting from this race
> over the course of a validation run.
Hm... okay so probably you were indeed running into that bug. I tried
to reproduce the other race (I added a BUG_ON to help detect it), but
wasn't able to reproduce it with my (pretty aggressive) stress test. I
only managed to trigger it by adding a conditional delay in the right
place. So I now know the second bug is really there, though it' seems
to be very unlikely to be hit in practice (might be more likely on
systems with many CPU cores, though). The first bug, OTOH, is
triggered almost instantly by my stress test.
Unless someone objects, I'll start working on a patch to switch back
to read-write lock for now. If all goes well, I'll send it sometime
next week.
>
> I'll continue to test with your changes and let you know if I end up
> running into the other race you spotted.
Thanks, but given the results of my testing it's probably not worth trying :)
>
> Tyler
>
> >
> > Basically, before the switch the sidtab swapover worked like this:
> > 1. Start live conversion of new entries.
> > 2. Convert existing entries.
> > [Still only the old sidtab is visible to readers here.]
> > 3. Swap sidtab under write lock.
> > 4. Now only the new sidtab is visible to readers, so the old one can
> > be destroyed.
> >
> > After the switch to RCU, we now have:
> > 1. Start live conversion of new entries.
> > 2. Convert existing entries.
> > 3. RCU-assign the new policy pointer to selinux_state.
> > [!!! Now actually both old and new sidtab may be referenced by
> > readers, since there is no synchronization barrier previously provided
> > by the write lock.]
> > 4. Wait for synchronize_rcu() to return.
> > 5. Now only the new sidtab is visible to readers, so the old one can
> > be destroyed.
> >
> > So the race can happen between 3. and 5., if one thread already sees
> > the new sidtab and adds a new entry there, and a second thread still
> > has the reference to the old sidtab and also tires to add a new entry;
> > live-converting to the new sidtab, which it doesn't expect to change
> > by itself. Unfortunately I failed to realize this when reviewing the
> > patch :/
> >
> > I think the only two options to fix it are A) switching back to
> > read-write lock (the easy and safe way; undoing the performance
> > benefits of [1]), or B) implementing a safe two-way live conversion of
> > new sidtab entries, so that both tables are kept in sync while they
> > are both available (more complicated and with possible tricky
> > implications of different interpretations of contexts by the two
> > policies).
> >
> > [1] 1b8b31a2e612 ("selinux: convert policy read-write lock to RCU")
> >
> > --
> > Ondrej Mosnacek
> > Software Engineer, Linux Security - SELinux kernel
> > Red Hat, Inc.
> >
>
--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
On 2021-02-25 17:38:25, Ondrej Mosnacek wrote:
> On Wed, Feb 24, 2021 at 3:43 PM Tyler Hicks <[email protected]> wrote:
> > On 2021-02-24 10:33:46, Ondrej Mosnacek wrote:
> > > On Tue, Feb 23, 2021 at 11:37 PM Tyler Hicks
> > > <[email protected]> wrote:
> > > > On 2021-02-23 15:50:56, Tyler Hicks wrote:
> > > > > On 2021-02-23 15:43:48, Tyler Hicks wrote:
> > > > > > I'm seeing a race during policy load while the "regular" sidtab
> > > > > > conversion is happening and a live conversion starts to take place in
> > > > > > sidtab_context_to_sid().
> > > > > >
> > > > > > We have an initial policy that's loaded by systemd ~0.6s into boot and
> > > > > > then another policy gets loaded ~2-3s into boot. That second policy load
> > > > > > is what hits the race condition situation because the sidtab is only
> > > > > > partially populated and there's a decent amount of filesystem operations
> > > > > > happening, at the same time, which are triggering live conversions.
> > > >
> > > > Hmm, perhaps this is the same problem that's fixed by Ondrej's proposed
> > > > change here:
> > > >
> > > > https://lore.kernel.org/selinux/[email protected]/
> > > >
> > > > I'll put these changes through a validation run (the only place that I
> > > > can seem to reproduce this crash) and see how it looks.
> > >
> > > Hm... I think there is actually another race condition introduced by
> > > the switch from rwlock to RCU [1]... Judging from the call trace you
> > > may be hitting that.
> >
> > I believe your patches above fixed the race I was seeing. I was able to
> > make it through a full validation run without any crashes. Without those
> > patches applied, I would see several crashes resulting from this race
> > over the course of a validation run.
>
> Hm... okay so probably you were indeed running into that bug. I tried
> to reproduce the other race (I added a BUG_ON to help detect it), but
> wasn't able to reproduce it with my (pretty aggressive) stress test. I
> only managed to trigger it by adding a conditional delay in the right
> place. So I now know the second bug is really there, though it' seems
> to be very unlikely to be hit in practice (might be more likely on
> systems with many CPU cores, though). The first bug, OTOH, is
> triggered almost instantly by my stress test.
>
> Unless someone objects, I'll start working on a patch to switch back
> to read-write lock for now. If all goes well, I'll send it sometime
> next week.
>
> >
> > I'll continue to test with your changes and let you know if I end up
> > running into the other race you spotted.
>
> Thanks, but given the results of my testing it's probably not worth trying :)
Those changes have now survived through several validation runs. I can
confidently say that they fix the race I was seeing.
Tyler
>
> >
> > Tyler
> >
> > >
> > > Basically, before the switch the sidtab swapover worked like this:
> > > 1. Start live conversion of new entries.
> > > 2. Convert existing entries.
> > > [Still only the old sidtab is visible to readers here.]
> > > 3. Swap sidtab under write lock.
> > > 4. Now only the new sidtab is visible to readers, so the old one can
> > > be destroyed.
> > >
> > > After the switch to RCU, we now have:
> > > 1. Start live conversion of new entries.
> > > 2. Convert existing entries.
> > > 3. RCU-assign the new policy pointer to selinux_state.
> > > [!!! Now actually both old and new sidtab may be referenced by
> > > readers, since there is no synchronization barrier previously provided
> > > by the write lock.]
> > > 4. Wait for synchronize_rcu() to return.
> > > 5. Now only the new sidtab is visible to readers, so the old one can
> > > be destroyed.
> > >
> > > So the race can happen between 3. and 5., if one thread already sees
> > > the new sidtab and adds a new entry there, and a second thread still
> > > has the reference to the old sidtab and also tires to add a new entry;
> > > live-converting to the new sidtab, which it doesn't expect to change
> > > by itself. Unfortunately I failed to realize this when reviewing the
> > > patch :/
> > >
> > > I think the only two options to fix it are A) switching back to
> > > read-write lock (the easy and safe way; undoing the performance
> > > benefits of [1]), or B) implementing a safe two-way live conversion of
> > > new sidtab entries, so that both tables are kept in sync while they
> > > are both available (more complicated and with possible tricky
> > > implications of different interpretations of contexts by the two
> > > policies).
> > >
> > > [1] 1b8b31a2e612 ("selinux: convert policy read-write lock to RCU")
> > >
> > > --
> > > Ondrej Mosnacek
> > > Software Engineer, Linux Security - SELinux kernel
> > > Red Hat, Inc.
> > >
> >
>
>
> --
> Ondrej Mosnacek
> Software Engineer, Linux Security - SELinux kernel
> Red Hat, Inc.
>
On Thu, Feb 25, 2021 at 11:38 AM Ondrej Mosnacek <[email protected]> wrote:
> Unless someone objects, I'll start working on a patch to switch back
> to read-write lock for now. If all goes well, I'll send it sometime
> next week.
Sorry, I was looking at other things and just got to this ... I'm not
overly excited about switching back to the read-write lock so quickly,
I'd rather we spend some additional time looking into resolving issues
with the current RCU code.
--
paul moore
http://www.paul-moore.com
On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <[email protected]> wrote:
> After the switch to RCU, we now have:
> 1. Start live conversion of new entries.
> 2. Convert existing entries.
> 3. RCU-assign the new policy pointer to selinux_state.
> [!!! Now actually both old and new sidtab may be referenced by
> readers, since there is no synchronization barrier previously provided
> by the write lock.]
> 4. Wait for synchronize_rcu() to return.
> 5. Now only the new sidtab is visible to readers, so the old one can
> be destroyed.
>
> So the race can happen between 3. and 5., if one thread already sees
> the new sidtab and adds a new entry there, and a second thread still
> has the reference to the old sidtab and also tires to add a new entry;
> live-converting to the new sidtab, which it doesn't expect to change
> by itself. Unfortunately I failed to realize this when reviewing the
> patch :/
It is possible I'm not fully understanding the problem and/or missing
an important detail - it is rather tricky code, and RCU can be very
hard to reason at times - but I think we may be able to solve this
with some lock fixes inside sidtab_context_to_sid(). Let me try to
explain to see if we are on the same page here ...
The problem is when we have two (or more) threads trying to
add/convert the same context into a sid; the task with new_sidtab is
looking to add a new sidtab entry, while the task with old_sidtab is
looking to convert an entry in old_sidtab into a new entry in
new_sidtab. Boom.
Looking at the code in sidtab_context_to_sid(), when we have two
sidtabs that are currently active (old_sidtab->convert pointer is
valid) and a task with old_sidtab attempts to add a new entry to both
sidtabs it first adds it to the old sidtab then it also adds it to the
new sidtab. I believe the problem is that in this case while the task
grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which
allows it to race with tasks that already see only new_sidtab. I
think adding code to sidtab_context_to_sid() which grabs the
new_sidtab->lock when adding entries to the new_sidtab *should* solve
the problem.
Did I miss something important? ;)
--
paul moore
http://www.paul-moore.com
On Fri, Feb 26, 2021 at 2:07 AM Paul Moore <[email protected]> wrote:
> On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <[email protected]> wrote:
> > After the switch to RCU, we now have:
> > 1. Start live conversion of new entries.
> > 2. Convert existing entries.
> > 3. RCU-assign the new policy pointer to selinux_state.
> > [!!! Now actually both old and new sidtab may be referenced by
> > readers, since there is no synchronization barrier previously provided
> > by the write lock.]
> > 4. Wait for synchronize_rcu() to return.
> > 5. Now only the new sidtab is visible to readers, so the old one can
> > be destroyed.
> >
> > So the race can happen between 3. and 5., if one thread already sees
> > the new sidtab and adds a new entry there, and a second thread still
> > has the reference to the old sidtab and also tires to add a new entry;
> > live-converting to the new sidtab, which it doesn't expect to change
> > by itself. Unfortunately I failed to realize this when reviewing the
> > patch :/
>
> It is possible I'm not fully understanding the problem and/or missing
> an important detail - it is rather tricky code, and RCU can be very
> hard to reason at times - but I think we may be able to solve this
> with some lock fixes inside sidtab_context_to_sid(). Let me try to
> explain to see if we are on the same page here ...
>
> The problem is when we have two (or more) threads trying to
> add/convert the same context into a sid; the task with new_sidtab is
> looking to add a new sidtab entry, while the task with old_sidtab is
> looking to convert an entry in old_sidtab into a new entry in
> new_sidtab. Boom.
>
> Looking at the code in sidtab_context_to_sid(), when we have two
> sidtabs that are currently active (old_sidtab->convert pointer is
> valid) and a task with old_sidtab attempts to add a new entry to both
> sidtabs it first adds it to the old sidtab then it also adds it to the
> new sidtab. I believe the problem is that in this case while the task
> grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which
> allows it to race with tasks that already see only new_sidtab. I
> think adding code to sidtab_context_to_sid() which grabs the
> new_sidtab->lock when adding entries to the new_sidtab *should* solve
> the problem.
>
> Did I miss something important? ;)
Sadly, yes :) Consider this scenario (assuming we fix the locking at
sidtab level):
If it happens that a new SID (x) is added via the new sidtab and then
another one (y) via the old sidtab, to avoid clash of SIDs, we would
need to leave a "hole" in the old sidtab for SID x. And this will
cause trouble if the thread that has just added SID y, then tries to
translate the context string corresponding to SID x (without re-taking
the RCU read lock and refreshing the policy pointer). Even if we
handle skipping the "holes" in the old sidtab safely, the translation
would then end up adding a duplicate SID entry for the context already
represented by SID x - which is not a state we want to end up in.
This is why I said that to fully fix this, we'd need to have a
both-ways live conversion in place. (And that already starts to feel
like too much hacking for something that should probably go to
stable@...)
--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
On Fri, Feb 26, 2021 at 5:08 AM Hillf Danton <[email protected]> wrote:
> On Thu, 25 Feb 2021 20:06:45 -0500 Paul Moore wrote:
> > On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <[email protected]> wrote:
> > > After the switch to RCU, we now have:
> > > 1. Start live conversion of new entries.
> > > 2. Convert existing entries.
> > > 3. RCU-assign the new policy pointer to selinux_state.
> > > [!!! Now actually both old and new sidtab may be referenced by
> > > readers, since there is no synchronization barrier previously provided
> > > by the write lock.]
> > > 4. Wait for synchronize_rcu() to return.
> > > 5. Now only the new sidtab is visible to readers, so the old one can
> > > be destroyed.
> > >
> > > So the race can happen between 3. and 5., if one thread already sees
> > > the new sidtab and adds a new entry there, and a second thread still
> > > has the reference to the old sidtab and also tires to add a new entry;
> > > live-converting to the new sidtab, which it doesn't expect to change
> > > by itself. Unfortunately I failed to realize this when reviewing the
> > > patch :/
> >
> > It is possible I'm not fully understanding the problem and/or missing
> > an important detail - it is rather tricky code, and RCU can be very
> > hard to reason at times - but I think we may be able to solve this
> > with some lock fixes inside sidtab_context_to_sid(). Let me try to
> > explain to see if we are on the same page here ...
> >
> > The problem is when we have two (or more) threads trying to
> > add/convert the same context into a sid; the task with new_sidtab is
> > looking to add a new sidtab entry, while the task with old_sidtab is
> > looking to convert an entry in old_sidtab into a new entry in
> > new_sidtab. Boom.
> >
> > Looking at the code in sidtab_context_to_sid(), when we have two
> > sidtabs that are currently active (old_sidtab->convert pointer is
> > valid) and a task with old_sidtab attempts to add a new entry to both
> > sidtabs it first adds it to the old sidtab then it also adds it to the
> > new sidtab. I believe the problem is that in this case while the task
> > grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which
> > allows it to race with tasks that already see only new_sidtab. I
> > think adding code to sidtab_context_to_sid() which grabs the
> > new_sidtab->lock when adding entries to the new_sidtab *should* solve
> > the problem.
> >
> > Did I miss something important? ;)
>
> If the convert pointer can be derefered without lock, we can opt to
> convert context after building sidtab with the risk of AB BA deadlock
> cut. Below is the minimum change I can think of along your direction.
We could fix this a bit more easily by just having a shared spinlock
for both (well, *all*) sidtabs. Yes, we'd need to have it all the way
up in selinux_state and pass it through to sidtab_init(), but IMHO
that's less bad than trying to get it right with two locks.
--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
On Fri, Feb 26, 2021 at 6:12 AM Ondrej Mosnacek <[email protected]> wrote:
> On Fri, Feb 26, 2021 at 2:07 AM Paul Moore <[email protected]> wrote:
> > On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <[email protected]> wrote:
> > > After the switch to RCU, we now have:
> > > 1. Start live conversion of new entries.
> > > 2. Convert existing entries.
> > > 3. RCU-assign the new policy pointer to selinux_state.
> > > [!!! Now actually both old and new sidtab may be referenced by
> > > readers, since there is no synchronization barrier previously provided
> > > by the write lock.]
> > > 4. Wait for synchronize_rcu() to return.
> > > 5. Now only the new sidtab is visible to readers, so the old one can
> > > be destroyed.
> > >
> > > So the race can happen between 3. and 5., if one thread already sees
> > > the new sidtab and adds a new entry there, and a second thread still
> > > has the reference to the old sidtab and also tires to add a new entry;
> > > live-converting to the new sidtab, which it doesn't expect to change
> > > by itself. Unfortunately I failed to realize this when reviewing the
> > > patch :/
> >
> > It is possible I'm not fully understanding the problem and/or missing
> > an important detail - it is rather tricky code, and RCU can be very
> > hard to reason at times - but I think we may be able to solve this
> > with some lock fixes inside sidtab_context_to_sid(). Let me try to
> > explain to see if we are on the same page here ...
> >
> > The problem is when we have two (or more) threads trying to
> > add/convert the same context into a sid; the task with new_sidtab is
> > looking to add a new sidtab entry, while the task with old_sidtab is
> > looking to convert an entry in old_sidtab into a new entry in
> > new_sidtab. Boom.
> >
> > Looking at the code in sidtab_context_to_sid(), when we have two
> > sidtabs that are currently active (old_sidtab->convert pointer is
> > valid) and a task with old_sidtab attempts to add a new entry to both
> > sidtabs it first adds it to the old sidtab then it also adds it to the
> > new sidtab. I believe the problem is that in this case while the task
> > grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which
> > allows it to race with tasks that already see only new_sidtab. I
> > think adding code to sidtab_context_to_sid() which grabs the
> > new_sidtab->lock when adding entries to the new_sidtab *should* solve
> > the problem.
> >
> > Did I miss something important? ;)
>
> Sadly, yes :) Consider this scenario (assuming we fix the locking at
> sidtab level):
>
> If it happens that a new SID (x) is added via the new sidtab and then
> another one (y) via the old sidtab, to avoid clash of SIDs, we would
> need to leave a "hole" in the old sidtab for SID x. And this will
> cause trouble if the thread that has just added SID y, then tries to
> translate the context string corresponding to SID x (without re-taking
> the RCU read lock and refreshing the policy pointer). Even if we
> handle skipping the "holes" in the old sidtab safely, the translation
> would then end up adding a duplicate SID entry for the context already
> represented by SID x - which is not a state we want to end up in.
Ah, yes, you're right. I was only thinking about the problem of
adding an entry to the old sidtab, and not the (much more likely case)
of an entry being added to the new sidtab. Bummer.
Thinking aloud for a moment - what if we simply refused to add new
sidtab entries if the task's sidtab pointer is "old"? Common sense
would tell us that this scenario should be very rare at present, and I
believe the testing mentioned in this thread adds some weight to that
claim. After all, this only affects tasks which entered into their
RCU protected session prior to the policy load RCU sync *AND* are
attempting to add a new entry to the sidtab. That *has* to be a
really low percentage, especially on a system that has been up and
running for some time. My gut feeling is this should be safe as well;
all of the calling code should have the necessary error handling in
place as there are plenty of reasons why we could normally fail to add
an entry to the sidtab; memory allocation failures being the most
obvious failure point I would suspect. This obvious downside to such
an approach is that those operations which do meet this criteria would
fail - and we should likely emit an error in this case - but is this
failure really worse than any other transient kernel failure, and is
attempting to mitigate this failure worth abandoning the RCU approach
for the sidtab?
--
paul moore
http://www.paul-moore.com
On Mon, Mar 1, 2021 at 5:36 AM Ondrej Mosnacek <[email protected]> wrote:
> On Sun, Feb 28, 2021 at 8:21 PM Paul Moore <[email protected]> wrote:
> > On Fri, Feb 26, 2021 at 6:12 AM Ondrej Mosnacek <[email protected]> wrote:
> > > On Fri, Feb 26, 2021 at 2:07 AM Paul Moore <[email protected]> wrote:
> > > > On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <[email protected]> wrote:
...
> > Ah, yes, you're right. I was only thinking about the problem of
> > adding an entry to the old sidtab, and not the (much more likely case)
> > of an entry being added to the new sidtab. Bummer.
> >
> > Thinking aloud for a moment - what if we simply refused to add new
> > sidtab entries if the task's sidtab pointer is "old"? Common sense
> > would tell us that this scenario should be very rare at present, and I
> > believe the testing mentioned in this thread adds some weight to that
> > claim. After all, this only affects tasks which entered into their
> > RCU protected session prior to the policy load RCU sync *AND* are
> > attempting to add a new entry to the sidtab. That *has* to be a
> > really low percentage, especially on a system that has been up and
> > running for some time. My gut feeling is this should be safe as well;
> > all of the calling code should have the necessary error handling in
> > place as there are plenty of reasons why we could normally fail to add
> > an entry to the sidtab; memory allocation failures being the most
> > obvious failure point I would suspect. This obvious downside to such
> > an approach is that those operations which do meet this criteria would
> > fail - and we should likely emit an error in this case - but is this
> > failure really worse than any other transient kernel failure,
>
> No, I don't like this approach at all. Before the sidtab refactor, it
> had been done exactly this way ...
I recognize I probably haven't made my feelings about reverts clear,
or if I have, I haven't done so recently. Let me fix that now: I
*hate* them. Further I hate reverts with a deep, passionate hatred
that I reserve for very few things. Maybe we have to revert this
change, even though I *hate* reverts they do remain an option; you
just need to be 99% sure you've exhausted all the other options first.
> Perhaps it wasn't clear from what I wrote, but I certainly don't want
> to abandon it completely. Just to revert to a safe state until we
> figure out how to do the RCU policy reload safely. The solution with
> two-way conversion seems doable, it's just not a quick and easy fix.
I suggest pursuing this before the revert to see what it looks like
and we can discuss it further during review.
--
paul moore
http://www.paul-moore.com
On Sun, Feb 28, 2021 at 8:21 PM Paul Moore <[email protected]> wrote:
> On Fri, Feb 26, 2021 at 6:12 AM Ondrej Mosnacek <[email protected]> wrote:
> > On Fri, Feb 26, 2021 at 2:07 AM Paul Moore <[email protected]> wrote:
> > > On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <[email protected]> wrote:
> > > > After the switch to RCU, we now have:
> > > > 1. Start live conversion of new entries.
> > > > 2. Convert existing entries.
> > > > 3. RCU-assign the new policy pointer to selinux_state.
> > > > [!!! Now actually both old and new sidtab may be referenced by
> > > > readers, since there is no synchronization barrier previously provided
> > > > by the write lock.]
> > > > 4. Wait for synchronize_rcu() to return.
> > > > 5. Now only the new sidtab is visible to readers, so the old one can
> > > > be destroyed.
> > > >
> > > > So the race can happen between 3. and 5., if one thread already sees
> > > > the new sidtab and adds a new entry there, and a second thread still
> > > > has the reference to the old sidtab and also tires to add a new entry;
> > > > live-converting to the new sidtab, which it doesn't expect to change
> > > > by itself. Unfortunately I failed to realize this when reviewing the
> > > > patch :/
> > >
> > > It is possible I'm not fully understanding the problem and/or missing
> > > an important detail - it is rather tricky code, and RCU can be very
> > > hard to reason at times - but I think we may be able to solve this
> > > with some lock fixes inside sidtab_context_to_sid(). Let me try to
> > > explain to see if we are on the same page here ...
> > >
> > > The problem is when we have two (or more) threads trying to
> > > add/convert the same context into a sid; the task with new_sidtab is
> > > looking to add a new sidtab entry, while the task with old_sidtab is
> > > looking to convert an entry in old_sidtab into a new entry in
> > > new_sidtab. Boom.
> > >
> > > Looking at the code in sidtab_context_to_sid(), when we have two
> > > sidtabs that are currently active (old_sidtab->convert pointer is
> > > valid) and a task with old_sidtab attempts to add a new entry to both
> > > sidtabs it first adds it to the old sidtab then it also adds it to the
> > > new sidtab. I believe the problem is that in this case while the task
> > > grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which
> > > allows it to race with tasks that already see only new_sidtab. I
> > > think adding code to sidtab_context_to_sid() which grabs the
> > > new_sidtab->lock when adding entries to the new_sidtab *should* solve
> > > the problem.
> > >
> > > Did I miss something important? ;)
> >
> > Sadly, yes :) Consider this scenario (assuming we fix the locking at
> > sidtab level):
> >
> > If it happens that a new SID (x) is added via the new sidtab and then
> > another one (y) via the old sidtab, to avoid clash of SIDs, we would
> > need to leave a "hole" in the old sidtab for SID x. And this will
> > cause trouble if the thread that has just added SID y, then tries to
> > translate the context string corresponding to SID x (without re-taking
> > the RCU read lock and refreshing the policy pointer). Even if we
> > handle skipping the "holes" in the old sidtab safely, the translation
> > would then end up adding a duplicate SID entry for the context already
> > represented by SID x - which is not a state we want to end up in.
>
> Ah, yes, you're right. I was only thinking about the problem of
> adding an entry to the old sidtab, and not the (much more likely case)
> of an entry being added to the new sidtab. Bummer.
>
> Thinking aloud for a moment - what if we simply refused to add new
> sidtab entries if the task's sidtab pointer is "old"? Common sense
> would tell us that this scenario should be very rare at present, and I
> believe the testing mentioned in this thread adds some weight to that
> claim. After all, this only affects tasks which entered into their
> RCU protected session prior to the policy load RCU sync *AND* are
> attempting to add a new entry to the sidtab. That *has* to be a
> really low percentage, especially on a system that has been up and
> running for some time. My gut feeling is this should be safe as well;
> all of the calling code should have the necessary error handling in
> place as there are plenty of reasons why we could normally fail to add
> an entry to the sidtab; memory allocation failures being the most
> obvious failure point I would suspect. This obvious downside to such
> an approach is that those operations which do meet this criteria would
> fail - and we should likely emit an error in this case - but is this
> failure really worse than any other transient kernel failure,
No, I don't like this approach at all. Before the sidtab refactor, it
had been done exactly this way - ENOMEM was returned while the sidtab
was "frozen" (i.e. while the existing entries were being converted).
And this was a real nuisance because things would fail randomly during
policy reload. And it's not just unimportant explicit userspace
actions that can fail. Any kind of transition can lead to a new SID
being created and you'd get things like execve(), mkdir(), ... return
-ENOMEM sometimes. (With a low probability, but still...)
I wouldn't compare it to a memory allocation failure, which normally
starts happening only when the system becomes overloaded. Here the
user would *awlays* have some probability of getting this error, and
they couldn't do anything about it.
> and is
> attempting to mitigate this failure worth abandoning the RCU approach
> for the sidtab?
Perhaps it wasn't clear from what I wrote, but I certainly don't want
to abandon it completely. Just to revert to a safe state until we
figure out how to do the RCU policy reload safely. The solution with
two-way conversion seems doable, it's just not a quick and easy fix.
--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.
On Sat, Feb 27, 2021 at 3:35 AM Hillf Danton <[email protected]> wrote:
> On Fri, 26 Feb 2021 12:19:35 +0100 Ondrej Mosnacek wrote:
> > On Fri, Feb 26, 2021 at 5:08 AM Hillf Danton <[email protected]> wrote:
> > > On Thu, 25 Feb 2021 20:06:45 -0500 Paul Moore wrote:
> > > > On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <[email protected]> wrote:
> > > > > After the switch to RCU, we now have:
> > > > > 1. Start live conversion of new entries.
> > > > > 2. Convert existing entries.
> > > > > 3. RCU-assign the new policy pointer to selinux_state.
> > > > > [!!! Now actually both old and new sidtab may be referenced by
> > > > > readers, since there is no synchronization barrier previously provided
> > > > > by the write lock.]
> > > > > 4. Wait for synchronize_rcu() to return.
> > > > > 5. Now only the new sidtab is visible to readers, so the old one can
> > > > > be destroyed.
> > > > >
> > > > > So the race can happen between 3. and 5., if one thread already sees
> > > > > the new sidtab and adds a new entry there, and a second thread still
> > > > > has the reference to the old sidtab and also tires to add a new entry;
> > > > > live-converting to the new sidtab, which it doesn't expect to change
> > > > > by itself. Unfortunately I failed to realize this when reviewing the
> > > > > patch :/
> > > >
> > > > It is possible I'm not fully understanding the problem and/or missing
> > > > an important detail - it is rather tricky code, and RCU can be very
> > > > hard to reason at times - but I think we may be able to solve this
> > > > with some lock fixes inside sidtab_context_to_sid(). Let me try to
> > > > explain to see if we are on the same page here ...
> > > >
> > > > The problem is when we have two (or more) threads trying to
> > > > add/convert the same context into a sid; the task with new_sidtab is
> > > > looking to add a new sidtab entry, while the task with old_sidtab is
> > > > looking to convert an entry in old_sidtab into a new entry in
> > > > new_sidtab. Boom.
> > > >
> > > > Looking at the code in sidtab_context_to_sid(), when we have two
> > > > sidtabs that are currently active (old_sidtab->convert pointer is
> > > > valid) and a task with old_sidtab attempts to add a new entry to both
> > > > sidtabs it first adds it to the old sidtab then it also adds it to the
> > > > new sidtab. I believe the problem is that in this case while the task
> > > > grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which
> > > > allows it to race with tasks that already see only new_sidtab. I
> > > > think adding code to sidtab_context_to_sid() which grabs the
> > > > new_sidtab->lock when adding entries to the new_sidtab *should* solve
> > > > the problem.
> > > >
> > > > Did I miss something important? ;)
> > >
> > > If the convert pointer can be derefered without lock, we can opt to
> > > convert context after building sidtab with the risk of AB BA deadlock
> > > cut. Below is the minimum change I can think of along your direction.
> >
> > We could fix this a bit more easily by just having a shared spinlock
> > for both (well, *all*) sidtabs. Yes, we'd need to have it all the way
>
> Looking forward to reading how to add another BKL,
This is not even remotely comparable to the Big Kernel Lock... 99.9%
of the time, there will be just one sidtab, otherwise there will be
just two for a short time during the policy reload, which need to be
updated in sync anyway.
>
> > up in selinux_state and pass it through to sidtab_init(), but IMHO
> > that's less bad than trying to get it right with two locks.
>
> instead of a nearly pure code churn in order to walk around deadlock.
I wish it were that simple, but it won't work this way. Consider a
scenario like this:
1. Thread A acquires the old policy pointer, encounters a new context,
adds it to the old sidtab as X, but doesn't start adding it to the new
one yet.
2. Thread B acquires the new policy pointer, encounters a different
new context and adds it to the new sidtab as X.
[Now you end up with two different contexts assigned to the same SID,
depending on the policy]
3. Now thread A continues to add the context to the new sidtab, but
since it reuses the count variable from before, it overwrites the
entry for X.
(And even if you fixed that, re-searched the new sidtab, and added the
new context to the end, you'd end up with the context having different
SIDs under the old vs. new policy, which is again a messy situation.)
You see, moving the sidtab code around can get you in trouble and
that's why I'm advocating for using a common lock - it's much easier
to prove that such change doesn't create new bugs. And as I said
above, merging the locks won't really have performance consequences,
so I still believe it to be the better choice here.
--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.