2021-12-10 16:27:43

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

Shift the overflow range from [0,INT_MIN] to [-1,INT_MIN], this allows
optimizing atomic_inc_overflow() to use "jle" to detect increment
from free-or-negative (with -1 being the new free and it's increment
being 0 which sets ZF).

This then obviously changes atomic_dec*_overflow() since it must now
detect the 0->-1 transition rather than the 1->0. Luckily this is
reflected in the carry flag (since we need to borrow to decrement 0).
However this means decrement must now use the SUB instruction with a
literal, since DEC doesn't set CF.

This then gives the following primitives:

[-1, INT_MIN] [0, INT_MIN]

inc() inc()
lock inc %[var] mov $-1, %[reg]
jle error-free-or-negative lock xadd %[reg], %[var]
test %[reg], %[reg]
jle error-zero-or-negative

dec() dec()
lock sub $1, %[var] lock dec %[var]
jc error-to-free jle error-zero-or-negative
jl error-from-negative

dec_and_test() dec_and_test()
lock sub $1, %[var] lock dec %[var]
jc do-free jl error-from-negative
jl error-from-negative je do-free

Make sure to set ATOMIC_OVERFLOW_OFFSET to 1 such that other code
interacting with these primitives can re-center 0.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/atomic.h | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -263,21 +263,31 @@ static __always_inline int arch_atomic_f
}
#define arch_atomic_fetch_xor arch_atomic_fetch_xor

-#define arch_atomic_dec_overflow(_v, _label) \
- asm_volatile_goto(LOCK_PREFIX "decl %[var]\n\t" \
+#define ATOMIC_OVERFLOW_OFFSET 1
+
+#define arch_atomic_inc_overflow(_v, _label) \
+ asm_volatile_goto(LOCK_PREFIX "incl %[var]\n\r" \
"jle %l1" \
: : [var] "m" ((_v)->counter) \
: "memory" \
: _label)

+#define arch_atomic_dec_overflow(_v, _label) \
+ asm_volatile_goto(LOCK_PREFIX "subl $1, %[var]\n\t" \
+ "jc %l1\n\t" \
+ "jl %l1" \
+ : : [var] "m" ((_v)->counter) \
+ : "memory" \
+ : _label)
+
#define arch_atomic_dec_and_test_overflow(_v, _label) \
({ \
__label__ __zero; \
__label__ __out; \
bool __ret = false; \
- asm_volatile_goto(LOCK_PREFIX "decl %[var]\n\t" \
- "jl %l2\n\t" \
- "je %l[__zero]" \
+ asm_volatile_goto(LOCK_PREFIX "subl $1, %[var]\n\t" \
+ "jc %l[__zero]\n\t" \
+ "jl %l2" \
: : [var] "m" ((_v)->counter) \
: "memory" \
: __zero, _label); \




2021-12-10 16:53:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Fri, Dec 10, 2021 at 8:27 AM Peter Zijlstra <[email protected]> wrote:
>
> Shift the overflow range from [0,INT_MIN] to [-1,INT_MIN], this allows
> optimizing atomic_inc_overflow() to use "jle" to detect increment
> from free-or-negative (with -1 being the new free and it's increment
> being 0 which sets ZF).

Thanks.

However, I think you can simplify this further:

> This then gives the following primitives:
>
> [-1, INT_MIN] [0, INT_MIN]
>
> inc() inc()
> lock inc %[var] mov $-1, %[reg]
> jle error-free-or-negative lock xadd %[reg], %[var]
> test %[reg], %[reg]
> jle error-zero-or-negative
>
> dec() dec()
> lock sub $1, %[var] lock dec %[var]
> jc error-to-free jle error-zero-or-negative
> jl error-from-negative
>
> dec_and_test() dec_and_test()
> lock sub $1, %[var] lock dec %[var]
> jc do-free jl error-from-negative
> jl error-from-negative je do-free

That "dec()" case could be just

lock dec %[var]
js error

because an underflow is an underflow - it doesn't matter if it's a "it
went to free" or "it became some other negative number".

That said - it may not matter - I'm not sure a plain "dec" is even a
valid operation on a ref in the first place. How could you ever
validly decrement a ref without checking for it being the last entry?

So I'm not sure "atomic_dec_overflow()" is even worth having as a
primitive, because I can't see any valid use for it. Is it for some
legacy case?

Linus

2021-12-10 17:28:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Fri, Dec 10, 2021 at 8:53 AM Linus Torvalds
<[email protected]> wrote:
>
> That said - it may not matter - I'm not sure a plain "dec" is even a
> valid operation on a ref in the first place. How could you ever
> validly decrement a ref without checking for it being the last entry?

I should have checked the users - it seems to be a pattern at least in
networking where people have extra references and do

refcount_dec(&skb->users);
dev_kfree_skb_any(skb);

because there's no way to tell dev_kfree_skb*() to decrement more than once.

So I guess it's all good, but yes, I still think you can just do "lock
dec .. js" for this operation.

Linus

2021-12-13 16:43:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Fri, Dec 10, 2021 at 05:16:26PM +0100, Peter Zijlstra wrote:
> Shift the overflow range from [0,INT_MIN] to [-1,INT_MIN], this allows
> optimizing atomic_inc_overflow() to use "jle" to detect increment
> from free-or-negative (with -1 being the new free and it's increment
> being 0 which sets ZF).
>
> This then obviously changes atomic_dec*_overflow() since it must now
> detect the 0->-1 transition rather than the 1->0. Luckily this is
> reflected in the carry flag (since we need to borrow to decrement 0).
> However this means decrement must now use the SUB instruction with a
> literal, since DEC doesn't set CF.
>
> This then gives the following primitives:
>
> [-1, INT_MIN] [0, INT_MIN]
>
> inc() inc()
> lock inc %[var] mov $-1, %[reg]
> jle error-free-or-negative lock xadd %[reg], %[var]
> test %[reg], %[reg]
> jle error-zero-or-negative
>
> dec() dec()
> lock sub $1, %[var] lock dec %[var]
> jc error-to-free jle error-zero-or-negative
> jl error-from-negative
>
> dec_and_test() dec_and_test()
> lock sub $1, %[var] lock dec %[var]
> jc do-free jl error-from-negative
> jl error-from-negative je do-free
>
> Make sure to set ATOMIC_OVERFLOW_OFFSET to 1 such that other code
> interacting with these primitives can re-center 0.

So Marco was expressing doubt about this exact interface for the
atomic_*_overflow() functions, since it's extremely easy to get the
whole ATOMIC_OVERFLOW_OFFSET thing wrong.

Since the current ops are strictly those that require inline asm, the
interface is fairly incomplete, which forces anybody who's going to use
these to provide whatever is missing. eg. atomic_inc_not_zero_overflow()
for example.

Another proposal had the user supply the offset as a compile time
constant to the function itself, raising a build-bug for any unsupported
offset. This would ensure the caller is at least aware of any non-zero
offset... still not going to really be dummy proof either.

Alternatively we could provide a more complete set of ops and/or a whole
new type, but... I'm not sure about that either.

I suppose I can try and do something like refcount_overflow_t and
implement the whole current refcount API in terms of that. Basically
everywhere we currently do refcount_warn_saturate() would become goto
label.

And then refcount_t could be a thin wrapper on top of that. But urgh...
lots of work, very little gain.

So what do we do? Keep things as is, and think about it again once we
got the first bug in hand, preemptively add a few ops or go completely
overboard?

Obviously I'm all for keeping things as is (less work for this lazy
bastard etc..)

2021-12-13 17:29:39

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Mon, Dec 13, 2021 at 05:43PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 10, 2021 at 05:16:26PM +0100, Peter Zijlstra wrote:
> > Shift the overflow range from [0,INT_MIN] to [-1,INT_MIN], this allows
> > optimizing atomic_inc_overflow() to use "jle" to detect increment
> > from free-or-negative (with -1 being the new free and it's increment
> > being 0 which sets ZF).
> >
> > This then obviously changes atomic_dec*_overflow() since it must now
> > detect the 0->-1 transition rather than the 1->0. Luckily this is
> > reflected in the carry flag (since we need to borrow to decrement 0).
> > However this means decrement must now use the SUB instruction with a
> > literal, since DEC doesn't set CF.
> >
> > This then gives the following primitives:
> >
> > [-1, INT_MIN] [0, INT_MIN]
> >
> > inc() inc()
> > lock inc %[var] mov $-1, %[reg]
> > jle error-free-or-negative lock xadd %[reg], %[var]
> > test %[reg], %[reg]
> > jle error-zero-or-negative
> >
> > dec() dec()
> > lock sub $1, %[var] lock dec %[var]
> > jc error-to-free jle error-zero-or-negative
> > jl error-from-negative
> >
> > dec_and_test() dec_and_test()
> > lock sub $1, %[var] lock dec %[var]
> > jc do-free jl error-from-negative
> > jl error-from-negative je do-free
> >
> > Make sure to set ATOMIC_OVERFLOW_OFFSET to 1 such that other code
> > interacting with these primitives can re-center 0.
>
> So Marco was expressing doubt about this exact interface for the
> atomic_*_overflow() functions, since it's extremely easy to get the
> whole ATOMIC_OVERFLOW_OFFSET thing wrong.
>
> Since the current ops are strictly those that require inline asm, the
> interface is fairly incomplete, which forces anybody who's going to use
> these to provide whatever is missing. eg. atomic_inc_not_zero_overflow()
> for example.
>
> Another proposal had the user supply the offset as a compile time
> constant to the function itself, raising a build-bug for any unsupported
> offset. This would ensure the caller is at least aware of any non-zero
> offset... still not going to really be dummy proof either.

In the spirit of making the interface harder to misuse, this would at
least ensure that non-refcount_t code that wants to use
atomic_*overflow() is 100% aware of this. Which is half of the issue I
think.

The other half is code using the actual values, and ensuring it's offset
correctly. This might also be an issue in e.g. refcount_t, if someone
wants to modify or extend it, although it's easy enough to audit and
review in such central data structures as refcount_t.

> Alternatively we could provide a more complete set of ops and/or a whole
> new type, but... I'm not sure about that either.
>
> I suppose I can try and do something like refcount_overflow_t and
> implement the whole current refcount API in terms of that. Basically
> everywhere we currently do refcount_warn_saturate() would become goto
> label.
>
> And then refcount_t could be a thin wrapper on top of that. But urgh...
> lots of work, very little gain.
>
> So what do we do? Keep things as is, and think about it again once we
> got the first bug in hand, preemptively add a few ops or go completely
> overboard?
>
> Obviously I'm all for keeping things as is (less work for this lazy
> bastard etc..)

I think an entirely new type might be overkill, but at the very least
designing the interface such that it's

A. either impossible to not notice the fact atomic_*overflow()
works in terms of offsets, or
B. not even exposing this detail.

#A can be achieved with supplying offsets to atomic_*overflow(). #B can
be achieved with new wrapper types -- however, if we somehow ensure that
refcount_t remains the only user of atomic_*overflow(), I'd consider
refcount_t a wrapper type already, so no need to add more.

Regarding the interface, it'd be nice if it could be made harder to
misuse, but I don't know how much it'll buy over what it is right now,
since we don't even know if there'll be other users of this yet.

But here are some more issues I just thought of:

1. A minor issue is inspecting raw values, like in register
dumps. refcount_t will now look different on x86 vs. other
architectures.

2. Yet another potentially larger issue is if some code
kmalloc()s some structs containing refcount_t, and relies on
GFP_ZERO (kzalloc()) to initialize their data assuming that a
freshly initialized refcount_t contains 0.

I think #1 is a cosmetic issue, which we might be able to live with.

However, I have absolutely no idea how we can audit or even prevent #2
from happening. With #2 in mind, and with C's lack of enforcing any kind
of "constructors", the interface and implementation we end up with is
going to result in near-impossible to debug issues sooner or later.

Thanks,
-- Marco

2021-12-13 18:11:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Mon, Dec 13, 2021 at 8:43 AM Peter Zijlstra <[email protected]> wrote:
>
> So Marco was expressing doubt about this exact interface for the
> atomic_*_overflow() functions, since it's extremely easy to get the
> whole ATOMIC_OVERFLOW_OFFSET thing wrong.

I missed that discussion (maybe it was on irc? Or maybe I just get too
much email).

Anyway, my preferred solution would simply be to make the ref-counting
atomics use a different type.

VoilĂ , problem solved. You can't really misuse them by mistake,
because you can't access it by mistake.

Sure, it could be a wrapper around 'atomic_t' on architectures that
end up using the generic fallback, so it might be as simple as

typedef atomic_t atomic_ref_t;

in some asm-generic implementation, although I suspect that you'd want
type safety even there, and do

typedef struct { atomic_t atomic_val; } atomic_ref_t;

But then on x86 - and other architectures that might prefer to use
that offset trick because they have flags - I'm not sure it even makes
sense to have anything to do with 'atomic_t' at all, since there would
basically be zero overlap with the regular atomic operations (partly
due to the offset, but partly simply because the 'ref' operations are
simply different).

(Wrt naming: I do think this is more about the "ref" part than the
"overflow" part - thus I'd suggest the "atomic_ref_t" rather than your
ofl naming).

Linus

2021-12-13 18:19:05

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Mon, Dec 13, 2021 at 10:11AM -0800, Linus Torvalds wrote:
> On Mon, Dec 13, 2021 at 8:43 AM Peter Zijlstra <[email protected]> wrote:
> >
> > So Marco was expressing doubt about this exact interface for the
> > atomic_*_overflow() functions, since it's extremely easy to get the
> > whole ATOMIC_OVERFLOW_OFFSET thing wrong.
>
> I missed that discussion (maybe it was on irc? Or maybe I just get too
> much email).
>
> Anyway, my preferred solution would simply be to make the ref-counting
> atomics use a different type.
>
> Voil?, problem solved. You can't really misuse them by mistake,
> because you can't access it by mistake.
>
> Sure, it could be a wrapper around 'atomic_t' on architectures that
> end up using the generic fallback, so it might be as simple as
>
> typedef atomic_t atomic_ref_t;
>
> in some asm-generic implementation, although I suspect that you'd want
> type safety even there, and do
>
> typedef struct { atomic_t atomic_val; } atomic_ref_t;
>
> But then on x86 - and other architectures that might prefer to use
> that offset trick because they have flags - I'm not sure it even makes
> sense to have anything to do with 'atomic_t' at all, since there would
> basically be zero overlap with the regular atomic operations (partly
> due to the offset, but partly simply because the 'ref' operations are
> simply different).
>
> (Wrt naming: I do think this is more about the "ref" part than the
> "overflow" part - thus I'd suggest the "atomic_ref_t" rather than your
> ofl naming).

I'm still genuinely worried about this:

> 2. Yet another potentially larger issue is if some code
> kmalloc()s some structs containing refcount_t, and relies on
> GFP_ZERO (kzalloc()) to initialize their data assuming that a
> freshly initialized refcount_t contains 0.

Even with everything properly wrapped up in atomic_ref_t, it's not going
to prevent mis-initialization via kzalloc() and friends.

I think C won't let us design that misuse out of existence.

Thanks,
-- Marco

2021-12-13 18:21:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Mon, Dec 13, 2021 at 10:11 AM Linus Torvalds
<[email protected]> wrote:
>
> But then on x86 - and other architectures that might prefer to use
> that offset trick because they have flags - I'm not sure it even makes
> sense to have anything to do with 'atomic_t' at all [..]

Side note: it might not be just about flags, and not just about things
like "offset-by-one".

We used to have special code for old 32-bit sparc because there you
couldn't actually sanely do 32-bit atomics: you could only really do
24 bits, because the whole first (upper) byte was the atomic lock
byte.

Now, that was too painful because nobody else did that, so these days
32-bit sparc just uses a hashed spinlock instead.

But I think it would be lovely to _design_ the atomic_ref_t to be able
to deal with that odd sparc32 issue too. Not because anybody cares
about 32-bit sparc any more, and not because I think anybody would
ever actually bother to write such code, but because I think it's a
good design goal to kind of aim for: if we're doing an actual
ref-counting data structure, where we do *not* have "int" semantics,
and very much have a concept of overflow, then I think it should
conceptually also work with that odd sparc32 24-byte atomic integer
model.

Was it a broken model? Was it stupid? Yes. Do I ever expect to see it
again? No. But I do think that conceptually we should strive to have
that as a _possible_ model.

In fact, it might be interesting to have something like that as a
debug model, where you have a smaller range for ref-counting, just to
make it easier to test that the code does the right thing for
overflow. 24 bits is a lot easier to overflow, while still being big
enough to work in practice.

So I do think that having a separate type system that simply does not
_work_ with somebody trying to do "atomic_xyz()" on it is the right
way to go.

Linus

2021-12-13 18:25:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Mon, Dec 13, 2021 at 10:19 AM Marco Elver <[email protected]> wrote:
>
> I'm still genuinely worried about this:
>
> > 2. Yet another potentially larger issue is if some code
> > kmalloc()s some structs containing refcount_t, and relies on
> > GFP_ZERO (kzalloc()) to initialize their data assuming that a
> > freshly initialized refcount_t contains 0.
>
> Even with everything properly wrapped up in atomic_ref_t, it's not going
> to prevent mis-initialization via kzalloc() and friends.

I agree that it's an issue, but it's not a new issue. We've had the
exact same thing with a lot of other core data structures.

And a ref-count of zero isn't valid _anyway_. When you allocate a
structure, a zero ref-count by definition is wrong. You need to set
the ref-count to the user that allocated it.

So I don't actually think the "implicit zero" is an issue in practice,
because it would be wrong in the first place. Code that relies on
kzmalloc() to initialize a refcount cannot work right.

(And by "cannot" I obviously mean "can, if you do wrong things" - it's
not like it's *impossible* to do an "atomic_inc_ref()" to change a 0
refcount to a 1, but it's both wrong *AND* actively stupid, since an
allocation does not need to set the refcount atomically).

Linus

2021-12-13 19:35:50

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

On Mon, Dec 13, 2021 at 10:24AM -0800, Linus Torvalds wrote:
> On Mon, Dec 13, 2021 at 10:19 AM Marco Elver <[email protected]> wrote:
> >
> > I'm still genuinely worried about this:
> >
> > > 2. Yet another potentially larger issue is if some code
> > > kmalloc()s some structs containing refcount_t, and relies on
> > > GFP_ZERO (kzalloc()) to initialize their data assuming that a
> > > freshly initialized refcount_t contains 0.
> >
> > Even with everything properly wrapped up in atomic_ref_t, it's not going
> > to prevent mis-initialization via kzalloc() and friends.
>
> I agree that it's an issue, but it's not a new issue. We've had the
> exact same thing with a lot of other core data structures.
>
> And a ref-count of zero isn't valid _anyway_. When you allocate a
> structure, a zero ref-count by definition is wrong. You need to set
> the ref-count to the user that allocated it.
>
> So I don't actually think the "implicit zero" is an issue in practice,
> because it would be wrong in the first place. Code that relies on
> kzmalloc() to initialize a refcount cannot work right.
>
> (And by "cannot" I obviously mean "can, if you do wrong things" - it's
> not like it's *impossible* to do an "atomic_inc_ref()" to change a 0
> refcount to a 1, but it's both wrong *AND* actively stupid, since an
> allocation does not need to set the refcount atomically).

I tried to throw syzkaller at this, but just booting results in this
warning:

| WARNING: CPU: 2 PID: 1 at block/blk-mq.c:3080 blk_mq_clear_flush_rq_mapping block/blk-mq.c:3080 [inline]
| WARNING: CPU: 2 PID: 1 at block/blk-mq.c:3080 blk_mq_exit_hctx+0x20e/0x220 block/blk-mq.c:3105
| Modules linked in:
| CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.16.0-rc5+ #3
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
| RIP: 0010:blk_mq_clear_flush_rq_mapping block/blk-mq.c:3080 [inline]
| RIP: 0010:blk_mq_exit_hctx+0x20e/0x220 block/blk-mq.c:3105
| Code: 00 31 d2 bf 9c 00 00 00 e8 7f 20 a2 ff e9 52 ff ff ff e8 65 22 b1 ff 4c 89 e7 e8 6d 77 00 00 e9 58 fe ff ff e8 53 22 b1 ff 90 <0f> 0b 90 e9 7e fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 55 41 54
| RSP: 0000:ffff9c10800139f0 EFLAGS: 00010293
| RAX: 0000000000000000 RBX: ffff970c428990b8 RCX: ffffffff9c662ecd
| RDX: ffff970c40880040 RSI: 0000000000000000 RDI: ffff970c4421e200
| RBP: ffff970c43ff9e80 R08: ffff970c43f14fe0 R09: ffff9c1080013a18
| R10: 0000000000000000 R11: 000000000001403d R12: ffff970c4421e200
| R13: ffff970c44237000 R14: 0000000000000100 R15: ffff970c4421e200
| FS: 0000000000000000(0000) GS:ffff970f6fc80000(0000) knlGS:0000000000000000
| CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| CR2: 0000000000000000 CR3: 00000002c2c0c001 CR4: 0000000000770ee0
| DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
| DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
| PKRU: 55555554
| Call Trace:
| <TASK>
| blk_mq_exit_hw_queues block/blk-mq.c:3130 [inline]
| blk_mq_exit_queue+0x70/0x1a0 block/blk-mq.c:3780
| blk_cleanup_queue+0xba/0x120 block/blk-core.c:368
| __scsi_remove_device+0x77/0x190 drivers/scsi/scsi_sysfs.c:1458
| scsi_probe_and_add_lun+0xb5c/0xfa0 drivers/scsi/scsi_scan.c:1221
| __scsi_scan_target+0x121/0x660 drivers/scsi/scsi_scan.c:1604
| scsi_scan_channel drivers/scsi/scsi_scan.c:1692 [inline]
| scsi_scan_channel+0x90/0xd0 drivers/scsi/scsi_scan.c:1668
| scsi_scan_host_selected+0x117/0x170 drivers/scsi/scsi_scan.c:1721
| do_scsi_scan_host+0xba/0xd0 drivers/scsi/scsi_scan.c:1860
| scsi_scan_host drivers/scsi/scsi_scan.c:1890 [inline]
| scsi_scan_host+0x1cf/0x1f0 drivers/scsi/scsi_scan.c:1878
| virtscsi_probe+0x3a0/0x3f0 drivers/scsi/virtio_scsi.c:915
| virtio_dev_probe+0x1e4/0x2f0 drivers/virtio/virtio.c:273
| call_driver_probe drivers/base/dd.c:517 [inline]
| really_probe.part.0+0xd3/0x320 drivers/base/dd.c:596
| really_probe drivers/base/dd.c:558 [inline]
| __driver_probe_device+0xbf/0x180 drivers/base/dd.c:751
| driver_probe_device+0x27/0xd0 drivers/base/dd.c:781
| __driver_attach drivers/base/dd.c:1140 [inline]
| __driver_attach+0xb4/0x1a0 drivers/base/dd.c:1092
| bus_for_each_dev+0xab/0x100 drivers/base/bus.c:301
| bus_add_driver+0x1c0/0x220 drivers/base/bus.c:618
| driver_register+0xc4/0x140 drivers/base/driver.c:171
| init+0xa0/0xea drivers/char/virtio_console.c:2257
| do_one_initcall+0x58/0x270 init/main.c:1297
| do_initcall_level init/main.c:1370 [inline]
| do_initcalls init/main.c:1386 [inline]
| do_basic_setup init/main.c:1405 [inline]
| kernel_init_freeable+0x1f4/0x259 init/main.c:1610
| kernel_init+0x19/0x180 init/main.c:1499
| ret_from_fork+0x22/0x30 arch/x86/entry/entry_64.S:295
| </TASK>

which is

WARN_ON_ONCE(refcount_read(&flush_rq->ref) != 0);

.config is attached. Reverting this patch resolves the warning. I
haven't been able to figure out which bit of that code assumes that
"freed" is represented by 0 refcount internally. There is an
alloc_pages() with GFP_ZERO in that code, which seems to allocate a
'struct request' pool that contains ->ref?

This just confirms my suspicion that this is a can of worms. We can try
to debug one issue after another until they are no more, but the
fundamental issue of saying that 0 is not 0 is a bug-magnet (in the
absence of real "constructors", which C doesn't have).

Thanks,
-- Marco


Attachments:
(No filename) (5.25 kB)
.config (131.27 kB)
Download all attachments

2021-12-17 03:38:19

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] atomic,x86: Alternative atomic_*_overflow() scheme

Linus Torvalds <[email protected]> wrote:
>
> That said - it may not matter - I'm not sure a plain "dec" is even a
> valid operation on a ref in the first place. How could you ever
> validly decrement a ref without checking for it being the last entry?

There are actual spots in the network stack where we know we're
holding multiple reference counts to a given object and in those
cases an unconditional "dec" could make sense. For example, we may
have an object that we obtained from a hash lookup, giving us a
reference count, which we then try to remove from a linked list,
also containing a referencnce count to it. While still holding
the referencnce count from the hash lookup, the linked list
referencnce count could be dropped with a plain "dec".

Of course we might be better off redesigning things to eliminate
reference counts completely but such code does still exist.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt