Restore LKML
On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
> On Wed 11-10-23 11:26:29, Yury Norov wrote:
> > Long story short: KCSAN found some potential issues related to how
> > people use bitmap API. And instead of working through that issues,
> > the following code shuts down KCSAN by applying READ_ONCE() here
> > and there.
>
> I'm sorry but this is not what the patch does. I'm not sure how to get the
> message across so maybe let me start from a different angle:
>
> Bitmaps are perfectly fine to be used without any external locking if
> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
> used. This is a significant performance gain compared to using a spinlock
> or other locking and people do this for a long time. I hope we agree on
> that.
>
> Now it is also common that you need to find a set / clear bit in a bitmap.
> To maintain lockless protocol and deal with races people employ schemes
> like (the dumbest form):
>
> do {
> bit = find_first_bit(bitmap, n);
> if (bit >= n)
> abort...
> } while (!test_and_clear_bit(bit, bitmap));
>
> So the code loops until it finds a set bit that is successfully cleared by
> it. This is perfectly fine and safe lockless code and such use should be
> supported. Agreed?
Great example. When you're running non-atomic functions concurrently,
the result may easily become incorrect, and this is what you're
demonstrating here.
Regarding find_first_bit() it means that:
- it may erroneously return unset bit;
- it may erroneously return non-first set bit;
- it may erroneously return no bits for non-empty bitmap.
Effectively it means that find_first bit may just return a random number.
Let's take another example:
do {
bit = get_random_number();
if (bit >= n)
abort...
} while (!test_and_clear_bit(bit, bitmap));
When running concurrently, the difference between this and your code
is only in probability of getting set bit somewhere from around the
beginning of bitmap.
The key point is that find_bit() may return undef even if READ_ONCE() is
used. If bitmap gets changed anytime in the process, the result becomes
invalid. It may happen even after returning from find_first_bit().
And if my understanding correct, your code is designed in the
assumption that find_first_bit() may return garbage, so handles it
correctly.
> *Except* that the above actually is not safe due to find_first_bit()
> implementation and KCSAN warns about that. The problem is that:
>
> Assume *addr == 1
> CPU1 CPU2
> find_first_bit(addr, 64)
> val = *addr;
> if (val) -> true
> clear_bit(0, addr)
> val = *addr -> compiler decided to refetch addr contents for whatever
> reason in the generated assembly
> __ffs(val) -> now executed for value 0 which has undefined results.
Yes, __ffs(0) is undef. But the whole function is undef when accessing
bitmap concurrently.
> And the READ_ONCE() this patch adds prevents the compiler from adding the
> refetching of addr into the assembly.
That's true. But it doesn't improve on the situation. It was an undef
before, and it's undef after, but a 2% slower undef.
Now on that KCSAN warning. If I understand things correctly, for the
example above, KCSAN warning is false-positive, because you're
intentionally running lockless.
But for some other people it may be a true error, and now they'll have
no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
We've got the whole class of lockless algorithms that allow safe concurrent
access to the memory. And now that there's a tool that searches for them
(concurrent accesses), we need to have an option to somehow teach it
to suppress irrelevant warnings. Maybe something like this?
lockless_algorithm_begin(bitmap, bitmap_size(nbits));
do {
bit = find_first_bit(bitmap, nbits);
if (bit >= nbits)
break;
} while (!test_and_clear_bit(bit, bitmap));
lockless_algorithm_end(bitmap, bitmap_size(nbits));
And, of course, as I suggested a couple iterations ago, you can invent
a thread-safe version of find_bit(), that would be perfectly correct
for lockless use:
unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
{
unsigned long bit = 0;
while (!test_and_clear_bit(bit, bitmap) {
bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
if (bit >= size)
return size;
}
return bit;
}
Didn't test that, but I hope 'volatile' specifier should be enough
for compiler to realize that it shouldn't optimize memory access, and
for KCSAN that everything's OK here.
By the way, thank you for respectful and professional communication.
Thanks,
Yury
On 10/14/2023 2:15 AM, Yury Norov wrote:
> Restore LKML
>
> On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
>> On Wed 11-10-23 11:26:29, Yury Norov wrote:
>>> Long story short: KCSAN found some potential issues related to how
>>> people use bitmap API. And instead of working through that issues,
>>> the following code shuts down KCSAN by applying READ_ONCE() here
>>> and there.
>>
>> I'm sorry but this is not what the patch does. I'm not sure how to get the
>> message across so maybe let me start from a different angle:
>>
>> Bitmaps are perfectly fine to be used without any external locking if
>> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
>> used. This is a significant performance gain compared to using a spinlock
>> or other locking and people do this for a long time. I hope we agree on
>> that.
>>
>> Now it is also common that you need to find a set / clear bit in a bitmap.
>> To maintain lockless protocol and deal with races people employ schemes
>> like (the dumbest form):
>>
>> do {
>> bit = find_first_bit(bitmap, n);
>> if (bit >= n)
>> abort...
>> } while (!test_and_clear_bit(bit, bitmap));
>>
>> So the code loops until it finds a set bit that is successfully cleared by
>> it. This is perfectly fine and safe lockless code and such use should be
>> supported. Agreed?
>
> Great example. When you're running non-atomic functions concurrently,
> the result may easily become incorrect, and this is what you're
> demonstrating here.
>
> Regarding find_first_bit() it means that:
> - it may erroneously return unset bit;
> - it may erroneously return non-first set bit;
> - it may erroneously return no bits for non-empty bitmap.
>
> Effectively it means that find_first bit may just return a random number.
>
> Let's take another example:
>
> do {
> bit = get_random_number();
> if (bit >= n)
> abort...
> } while (!test_and_clear_bit(bit, bitmap));
>
> When running concurrently, the difference between this and your code
> is only in probability of getting set bit somewhere from around the
> beginning of bitmap.
>
> The key point is that find_bit() may return undef even if READ_ONCE() is
> used. If bitmap gets changed anytime in the process, the result becomes
> invalid. It may happen even after returning from find_first_bit().
>
> And if my understanding correct, your code is designed in the
> assumption that find_first_bit() may return garbage, so handles it
> correctly.
>
>> *Except* that the above actually is not safe due to find_first_bit()
>> implementation and KCSAN warns about that. The problem is that:
>>
>> Assume *addr == 1
>> CPU1 CPU2
>> find_first_bit(addr, 64)
>> val = *addr;
>> if (val) -> true
>> clear_bit(0, addr)
>> val = *addr -> compiler decided to refetch addr contents for whatever
>> reason in the generated assembly
>> __ffs(val) -> now executed for value 0 which has undefined results.
>
> Yes, __ffs(0) is undef. But the whole function is undef when accessing
> bitmap concurrently.
>
>> And the READ_ONCE() this patch adds prevents the compiler from adding the
>> refetching of addr into the assembly.
>
> That's true. But it doesn't improve on the situation. It was an undef
> before, and it's undef after, but a 2% slower undef.
>
> Now on that KCSAN warning. If I understand things correctly, for the
> example above, KCSAN warning is false-positive, because you're
> intentionally running lockless.
>
> But for some other people it may be a true error, and now they'll have
> no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
>
> We've got the whole class of lockless algorithms that allow safe concurrent
> access to the memory. And now that there's a tool that searches for them
> (concurrent accesses), we need to have an option to somehow teach it
> to suppress irrelevant warnings. Maybe something like this?
>
> lockless_algorithm_begin(bitmap, bitmap_size(nbits));
> do {
> bit = find_first_bit(bitmap, nbits);
> if (bit >= nbits)
> break;
> } while (!test_and_clear_bit(bit, bitmap));
> lockless_algorithm_end(bitmap, bitmap_size(nbits));
>
> And, of course, as I suggested a couple iterations ago, you can invent
> a thread-safe version of find_bit(), that would be perfectly correct
> for lockless use:
>
> unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
> {
> unsigned long bit = 0;
>
> while (!test_and_clear_bit(bit, bitmap) {
> bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
> if (bit >= size)
> return size;
> }
>
> return bit;
> }
Hi, Yuri,
But the code above effectively does the same as the READ_ONCE() macro
as defined in rwonce.h:
#ifndef __READ_ONCE
#define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
#endif
#define READ_ONCE(x) \
({ \
compiletime_assert_rwonce_type(x); \
__READ_ONCE(x); \
})
Both uses only prevent the funny stuff the compiler might have done to the
read of the addr[idx], there's no black magic in READ_ONCE().
Both examples would probably result in the same assembly and produce the
same 2% slowdown ...
Only you declare volatile in one place, and READ_ONCE() in each read, but
this will only compile a bit slower and generate the same machine code.
Best regards,
Mirsad Todorovac
> Didn't test that, but I hope 'volatile' specifier should be enough
> for compiler to realize that it shouldn't optimize memory access, and
> for KCSAN that everything's OK here.
>
> By the way, thank you for respectful and professional communication.
>
> Thanks,
> Yury
--
Mirsad Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu
System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia
tel. +385 (0)1 3711 451
mob. +385 91 57 88 355
On Sat, Oct 14, 2023 at 04:21:50AM +0200, Mirsad Goran Todorovac wrote:
> On 10/14/2023 2:15 AM, Yury Norov wrote:
> > Restore LKML
> >
> > On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
> > > On Wed 11-10-23 11:26:29, Yury Norov wrote:
> > > > Long story short: KCSAN found some potential issues related to how
> > > > people use bitmap API. And instead of working through that issues,
> > > > the following code shuts down KCSAN by applying READ_ONCE() here
> > > > and there.
> > >
> > > I'm sorry but this is not what the patch does. I'm not sure how to get the
> > > message across so maybe let me start from a different angle:
> > >
> > > Bitmaps are perfectly fine to be used without any external locking if
> > > only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
> > > used. This is a significant performance gain compared to using a spinlock
> > > or other locking and people do this for a long time. I hope we agree on
> > > that.
> > >
> > > Now it is also common that you need to find a set / clear bit in a bitmap.
> > > To maintain lockless protocol and deal with races people employ schemes
> > > like (the dumbest form):
> > >
> > > do {
> > > bit = find_first_bit(bitmap, n);
> > > if (bit >= n)
> > > abort...
> > > } while (!test_and_clear_bit(bit, bitmap));
> > >
> > > So the code loops until it finds a set bit that is successfully cleared by
> > > it. This is perfectly fine and safe lockless code and such use should be
> > > supported. Agreed?
> >
> > Great example. When you're running non-atomic functions concurrently,
> > the result may easily become incorrect, and this is what you're
> > demonstrating here.
> >
> > Regarding find_first_bit() it means that:
> > - it may erroneously return unset bit;
> > - it may erroneously return non-first set bit;
> > - it may erroneously return no bits for non-empty bitmap.
> >
> > Effectively it means that find_first bit may just return a random number.
> >
> > Let's take another example:
> >
> > do {
> > bit = get_random_number();
> > if (bit >= n)
> > abort...
> > } while (!test_and_clear_bit(bit, bitmap));
> >
> > When running concurrently, the difference between this and your code
> > is only in probability of getting set bit somewhere from around the
> > beginning of bitmap.
> >
> > The key point is that find_bit() may return undef even if READ_ONCE() is
> > used. If bitmap gets changed anytime in the process, the result becomes
> > invalid. It may happen even after returning from find_first_bit().
> >
> > And if my understanding correct, your code is designed in the
> > assumption that find_first_bit() may return garbage, so handles it
> > correctly.
> >
> > > *Except* that the above actually is not safe due to find_first_bit()
> > > implementation and KCSAN warns about that. The problem is that:
> > >
> > > Assume *addr == 1
> > > CPU1 CPU2
> > > find_first_bit(addr, 64)
> > > val = *addr;
> > > if (val) -> true
> > > clear_bit(0, addr)
> > > val = *addr -> compiler decided to refetch addr contents for whatever
> > > reason in the generated assembly
> > > __ffs(val) -> now executed for value 0 which has undefined results.
> >
> > Yes, __ffs(0) is undef. But the whole function is undef when accessing
> > bitmap concurrently.
> >
> > > And the READ_ONCE() this patch adds prevents the compiler from adding the
> > > refetching of addr into the assembly.
> >
> > That's true. But it doesn't improve on the situation. It was an undef
> > before, and it's undef after, but a 2% slower undef.
> >
> > Now on that KCSAN warning. If I understand things correctly, for the
> > example above, KCSAN warning is false-positive, because you're
> > intentionally running lockless.
> >
> > But for some other people it may be a true error, and now they'll have
> > no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
> >
> > We've got the whole class of lockless algorithms that allow safe concurrent
> > access to the memory. And now that there's a tool that searches for them
> > (concurrent accesses), we need to have an option to somehow teach it
> > to suppress irrelevant warnings. Maybe something like this?
> >
> > lockless_algorithm_begin(bitmap, bitmap_size(nbits));
> > do {
> > bit = find_first_bit(bitmap, nbits);
> > if (bit >= nbits)
> > break;
> > } while (!test_and_clear_bit(bit, bitmap));
> > lockless_algorithm_end(bitmap, bitmap_size(nbits));
> >
> > And, of course, as I suggested a couple iterations ago, you can invent
> > a thread-safe version of find_bit(), that would be perfectly correct
> > for lockless use:
> >
> > unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
> > {
> > unsigned long bit = 0;
> > while (!test_and_clear_bit(bit, bitmap) {
> > bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
> > if (bit >= size)
> > return size;
> > }
> >
> > return bit;
> > }
>
> Hi, Yuri,
>
> But the code above effectively does the same as the READ_ONCE() macro
> as defined in rwonce.h:
>
> #ifndef __READ_ONCE
> #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
> #endif
>
> #define READ_ONCE(x) \
> ({ \
> compiletime_assert_rwonce_type(x); \
> __READ_ONCE(x); \
> })
>
> Both uses only prevent the funny stuff the compiler might have done to the
> read of the addr[idx], there's no black magic in READ_ONCE().
>
> Both examples would probably result in the same assembly and produce the
> same 2% slowdown ...
>
> Only you declare volatile in one place, and READ_ONCE() in each read, but
> this will only compile a bit slower and generate the same machine code.
The difference is that find_and_clear_bit() has a semantics of
atomic operation. Those who will decide to use it will also anticipate
associate downsides. And other hundreds (or thousands) users of
non-atomic find_bit() functions will not have to pay extra buck
for unneeded atomicity.
Check how 'volatile' is used in test_and_clear_bit(), and consider
find_and_clear_bit() as a wrapper around test_and_clear_bit().
In other words, this patch suggests to make find_bit() thread-safe by
using READ_ONCE(), and it doesn't work. find_and_clear_bit(), on the
other hand, is simply a wrapper around test_and_clear_bit(), and
doesn't imply any new restriction that test_and_clear_bit() doesn't.
Think of it as an optimized version of:
while (bit < nbits && !test_and_clear_bit(bit, bitmap)
bit++;
If you think it's worth to try in your code, I can send a patch for
you.
Thanks,
Yury
On 10/14/23 04:53, Yury Norov wrote:
> On Sat, Oct 14, 2023 at 04:21:50AM +0200, Mirsad Goran Todorovac wrote:
>> On 10/14/2023 2:15 AM, Yury Norov wrote:
>>> Restore LKML
>>>
>>> On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
>>>> On Wed 11-10-23 11:26:29, Yury Norov wrote:
>>>>> Long story short: KCSAN found some potential issues related to how
>>>>> people use bitmap API. And instead of working through that issues,
>>>>> the following code shuts down KCSAN by applying READ_ONCE() here
>>>>> and there.
>>>>
>>>> I'm sorry but this is not what the patch does. I'm not sure how to get the
>>>> message across so maybe let me start from a different angle:
>>>>
>>>> Bitmaps are perfectly fine to be used without any external locking if
>>>> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
>>>> used. This is a significant performance gain compared to using a spinlock
>>>> or other locking and people do this for a long time. I hope we agree on
>>>> that.
>>>>
>>>> Now it is also common that you need to find a set / clear bit in a bitmap.
>>>> To maintain lockless protocol and deal with races people employ schemes
>>>> like (the dumbest form):
>>>>
>>>> do {
>>>> bit = find_first_bit(bitmap, n);
>>>> if (bit >= n)
>>>> abort...
>>>> } while (!test_and_clear_bit(bit, bitmap));
>>>>
>>>> So the code loops until it finds a set bit that is successfully cleared by
>>>> it. This is perfectly fine and safe lockless code and such use should be
>>>> supported. Agreed?
>>>
>>> Great example. When you're running non-atomic functions concurrently,
>>> the result may easily become incorrect, and this is what you're
>>> demonstrating here.
>>>
>>> Regarding find_first_bit() it means that:
>>> - it may erroneously return unset bit;
>>> - it may erroneously return non-first set bit;
>>> - it may erroneously return no bits for non-empty bitmap.
>>>
>>> Effectively it means that find_first bit may just return a random number.
>>>
>>> Let's take another example:
>>>
>>> do {
>>> bit = get_random_number();
>>> if (bit >= n)
>>> abort...
>>> } while (!test_and_clear_bit(bit, bitmap));
>>>
>>> When running concurrently, the difference between this and your code
>>> is only in probability of getting set bit somewhere from around the
>>> beginning of bitmap.
>>>
>>> The key point is that find_bit() may return undef even if READ_ONCE() is
>>> used. If bitmap gets changed anytime in the process, the result becomes
>>> invalid. It may happen even after returning from find_first_bit().
>>>
>>> And if my understanding correct, your code is designed in the
>>> assumption that find_first_bit() may return garbage, so handles it
>>> correctly.
>>>
>>>> *Except* that the above actually is not safe due to find_first_bit()
>>>> implementation and KCSAN warns about that. The problem is that:
>>>>
>>>> Assume *addr == 1
>>>> CPU1 CPU2
>>>> find_first_bit(addr, 64)
>>>> val = *addr;
>>>> if (val) -> true
>>>> clear_bit(0, addr)
>>>> val = *addr -> compiler decided to refetch addr contents for whatever
>>>> reason in the generated assembly
>>>> __ffs(val) -> now executed for value 0 which has undefined results.
>>>
>>> Yes, __ffs(0) is undef. But the whole function is undef when accessing
>>> bitmap concurrently.
>>>
>>>> And the READ_ONCE() this patch adds prevents the compiler from adding the
>>>> refetching of addr into the assembly.
>>>
>>> That's true. But it doesn't improve on the situation. It was an undef
>>> before, and it's undef after, but a 2% slower undef.
>>>
>>> Now on that KCSAN warning. If I understand things correctly, for the
>>> example above, KCSAN warning is false-positive, because you're
>>> intentionally running lockless.
>>>
>>> But for some other people it may be a true error, and now they'll have
>>> no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
>>>
>>> We've got the whole class of lockless algorithms that allow safe concurrent
>>> access to the memory. And now that there's a tool that searches for them
>>> (concurrent accesses), we need to have an option to somehow teach it
>>> to suppress irrelevant warnings. Maybe something like this?
>>>
>>> lockless_algorithm_begin(bitmap, bitmap_size(nbits));
>>> do {
>>> bit = find_first_bit(bitmap, nbits);
>>> if (bit >= nbits)
>>> break;
>>> } while (!test_and_clear_bit(bit, bitmap));
>>> lockless_algorithm_end(bitmap, bitmap_size(nbits));
>>>
>>> And, of course, as I suggested a couple iterations ago, you can invent
>>> a thread-safe version of find_bit(), that would be perfectly correct
>>> for lockless use:
>>>
>>> unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
>>> {
>>> unsigned long bit = 0;
>>> while (!test_and_clear_bit(bit, bitmap) {
>>> bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
>>> if (bit >= size)
>>> return size;
>>> }
>>>
>>> return bit;
>>> }
>>
>> Hi, Yuri,
>>
>> But the code above effectively does the same as the READ_ONCE() macro
>> as defined in rwonce.h:
>>
>> #ifndef __READ_ONCE
>> #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
>> #endif
>>
>> #define READ_ONCE(x) \
>> ({ \
>> compiletime_assert_rwonce_type(x); \
>> __READ_ONCE(x); \
>> })
>>
>> Both uses only prevent the funny stuff the compiler might have done to the
>> read of the addr[idx], there's no black magic in READ_ONCE().
>>
>> Both examples would probably result in the same assembly and produce the
>> same 2% slowdown ...
>>
>> Only you declare volatile in one place, and READ_ONCE() in each read, but
>> this will only compile a bit slower and generate the same machine code.
>
> The difference is that find_and_clear_bit() has a semantics of
> atomic operation. Those who will decide to use it will also anticipate
> associate downsides. And other hundreds (or thousands) users of
> non-atomic find_bit() functions will not have to pay extra buck
> for unneeded atomicity.
>
> Check how 'volatile' is used in test_and_clear_bit(), and consider
> find_and_clear_bit() as a wrapper around test_and_clear_bit().
>
> In other words, this patch suggests to make find_bit() thread-safe by
> using READ_ONCE(), and it doesn't work. find_and_clear_bit(), on the
> other hand, is simply a wrapper around test_and_clear_bit(), and
> doesn't imply any new restriction that test_and_clear_bit() doesn't.
>
> Think of it as an optimized version of:
> while (bit < nbits && !test_and_clear_bit(bit, bitmap)
> bit++;
>
> If you think it's worth to try in your code, I can send a patch for
> you.
>
> Thanks,
> Yury
After some thinking, your declaration:
>>> unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
OK, this makes "addr" a pointer to a volatile array of unsigned longs.
But to this I have an objection:
> while (bit < nbits && !test_and_clear_bit(bit, bitmap)
> bit++;
Note that there is nothing magical in an atomic test_and_clear_bit():
it has to read entire (quad)word, remember the stat of the bit, clear it,
and write it back.
The problem is that LOCK prefix comes before the assembled instruction
LOCK BTR r/m32, imm8
so it would be executed atomically.
Otherwise there are no guarantees that other core wouldn't write its own
idea of the value.
But atomic test_and_clear_bit() is not a free lunch: you would LOCK the
bus for all cores except yours 32/64 times for each bit, per (quad)word tested.
That is 32/64 times more than it is optimal, and it looks like a real hog.
Do you see the difference?
Needless to say, your atomicity works for one bit, and nothing prevents i.e.
core 5 to modify/set atomically bits you have already tested and found clear ...
Ideally, you would use atomic ffs() which is compiled as a single and atomic BSF/BSR
instruction on x86.
BSR r32, r/m32
(Alternatively you might want BSL in your algorithm, scanning from the least
significant bit.)
Even better would be if we could also atomically clear that bit in memory and
have the instruction return its index.
Test is atomic because it is a single instruction, but it can also be prefixed
with a LOCK.
LOCK BSR reg, mem
LOCK BTR mem, reg
would unfortunately not work, because something unfortunate could change the
memory location in between.
But your proposed algorithm is nothing more atomic than
bit = ffs(bitmap);
test_and_clear_bit(bit, bitmap);
And only less efficient, since you use on average 16/32 bus LOCKs instead of
one assembly instruction BSR/BRL. Those LOCK prefixes would mean that the entire
set of cores is prevented from reading or modifying memory while the bus is
LOCKed, or only writes on smarted architectures with WRITE LOCK in assembly,
as reads won't hurt the process of test_and_clear_bit(), only might give an
out-of-sync value.
Whatever fancy C function or macro, it cannot outsmart the CPU instruction set if
the set doesn't have that one.
For x86/x86_64, I couldn't find an atomic instruction to find the first set bit
and atomically clear it, returning its value ... but it doesn't mean nobody else knows.
Regards,
Mirsad
On Fri 13-10-23 17:15:28, Yury Norov wrote:
> On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote:
> > On Wed 11-10-23 11:26:29, Yury Norov wrote:
> > > Long story short: KCSAN found some potential issues related to how
> > > people use bitmap API. And instead of working through that issues,
> > > the following code shuts down KCSAN by applying READ_ONCE() here
> > > and there.
> >
> > I'm sorry but this is not what the patch does. I'm not sure how to get the
> > message across so maybe let me start from a different angle:
> >
> > Bitmaps are perfectly fine to be used without any external locking if
> > only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are
> > used. This is a significant performance gain compared to using a spinlock
> > or other locking and people do this for a long time. I hope we agree on
> > that.
> >
> > Now it is also common that you need to find a set / clear bit in a bitmap.
> > To maintain lockless protocol and deal with races people employ schemes
> > like (the dumbest form):
> >
> > do {
> > bit = find_first_bit(bitmap, n);
> > if (bit >= n)
> > abort...
> > } while (!test_and_clear_bit(bit, bitmap));
> >
> > So the code loops until it finds a set bit that is successfully cleared by
> > it. This is perfectly fine and safe lockless code and such use should be
> > supported. Agreed?
>
> Great example. When you're running non-atomic functions concurrently,
> the result may easily become incorrect, and this is what you're
> demonstrating here.
>
> Regarding find_first_bit() it means that:
> - it may erroneously return unset bit;
> - it may erroneously return non-first set bit;
> - it may erroneously return no bits for non-empty bitmap.
Correct.
> Effectively it means that find_first bit may just return a random number.
I prefer to think that it can return a result that is no longer valid by
the time we further use it :)
> Let's take another example:
>
> do {
> bit = get_random_number();
> if (bit >= n)
> abort...
> } while (!test_and_clear_bit(bit, bitmap));
>
> When running concurrently, the difference between this and your code
> is only in probability of getting set bit somewhere from around the
> beginning of bitmap.
Well, as you say the difference is in the probability - i.e., average
number of loops taken is higher with using truly random number and that is
the whole point. We bother with complexity of lockless access exactly
because of performance :). As long as find_first_bit() returns set bit in
case there's no collision with other bitmap modification, we are fine with
its results (usually we don't expect the collision to happen, often the
bitmap users also employ schemes to spread different processes modifying
the bitmap to different parts of the bitmap to further reduce likelyhood of
a collision).
> The key point is that find_bit() may return undef even if READ_ONCE() is
> used. If bitmap gets changed anytime in the process, the result becomes
> invalid. It may happen even after returning from find_first_bit().
>
> And if my understanding correct, your code is designed in the
> assumption that find_first_bit() may return garbage, so handles it
> correctly.
Yes, that is true.
> > *Except* that the above actually is not safe due to find_first_bit()
> > implementation and KCSAN warns about that. The problem is that:
> >
> > Assume *addr == 1
> > CPU1 CPU2
> > find_first_bit(addr, 64)
> > val = *addr;
> > if (val) -> true
> > clear_bit(0, addr)
> > val = *addr -> compiler decided to refetch addr contents for whatever
> > reason in the generated assembly
> > __ffs(val) -> now executed for value 0 which has undefined results.
>
> Yes, __ffs(0) is undef. But the whole function is undef when accessing
> bitmap concurrently.
So here I think we get at the core of our misunderstanding :): Yes,
find_first_bit() may return a bit number that is not set any longer. But it
is guaranteed to return some number between 0 and n where n is the bitmap
size. What __ffs() does when passed 0 value is unclear and likely will be
architecture dependent. If we are guaranteed it returns some number between
0 and 8*sizeof(unsigned long), then we are fine. But I'm concerned it may
throw exception (similarly to division by 0) or return number greater than
8*sizeof(unsigned long) for some architecture and that would be a problem.
E.g. reading the x86 bsf instruction documentation, the destination
register is untouched if there is no set bit so the result can indeed be >
8*sizeof(unsigned long). So __ffs(0) can result in returning a number
beyond the end of the bitmap (e.g. 0xffffffff). And that is IMO
unacceptable output for find_first_bit().
> > And the READ_ONCE() this patch adds prevents the compiler from adding the
> > refetching of addr into the assembly.
>
> That's true. But it doesn't improve on the situation. It was an undef
> before, and it's undef after, but a 2% slower undef.
>
> Now on that KCSAN warning. If I understand things correctly, for the
> example above, KCSAN warning is false-positive, because you're
> intentionally running lockless.
As I wrote above, there are different levels of "undefinedness" and that
matters in this case. KCSAN is complaining that the value passed to __ffs()
function may be different one from the one tested in the condition before
it. Depending on exact __ffs() behavior this may be fine or it may be not.
> But for some other people it may be a true error, and now they'll have
> no chance to catch it if KCSAN is forced to ignore find_bit() entirely.
I agree some people may accidentally use bitmap function unlocked without
properly handling the races. However in this case KCSAN does not warn about
unsafe use of the result from find_bit() (which is what should happen for
those unsafe uses). It complains about unsafe internal implementation of
find_bit() when it is used without external synchronization. These two are
different things so I don't think this is a good argument for leaving the
race in find_bit().
Furthermore I'd note that READ_ONCE() does not make KCSAN ignore find_bit()
completely. READ_ONCE() forces the compiler to use the same value for the
test and __ffs() argument (by telling it it cannot assume the standard C
memory model using "volatile" keyword for this fetch). That's all. That
makes it impossible for KCSAN to inject a modification of the bitmap &
refetch from memory inbetween the two uses of the local variable and thus
it doesn't generate the warning anymore.
> We've got the whole class of lockless algorithms that allow safe concurrent
> access to the memory. And now that there's a tool that searches for them
> (concurrent accesses), we need to have an option to somehow teach it
> to suppress irrelevant warnings. Maybe something like this?
>
> lockless_algorithm_begin(bitmap, bitmap_size(nbits));
> do {
> bit = find_first_bit(bitmap, nbits);
> if (bit >= nbits)
> break;
> } while (!test_and_clear_bit(bit, bitmap));
> lockless_algorithm_end(bitmap, bitmap_size(nbits));
>
> And, of course, as I suggested a couple iterations ago, you can invent
> a thread-safe version of find_bit(), that would be perfectly correct
> for lockless use:
>
> unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size)
> {
> unsigned long bit = 0;
>
> while (!test_and_clear_bit(bit, bitmap) {
> bit = FIND_FIRST_BIT(addr[idx], /* nop */, size);
> if (bit >= size)
> return size;
> }
>
> return bit;
> }
>
> Didn't test that, but I hope 'volatile' specifier should be enough
> for compiler to realize that it shouldn't optimize memory access, and
> for KCSAN that everything's OK here.
Based on my research regarding __ffs() we indeed do need find_*_bit()
functions that are guaranteed to return number in 0-n range even in
presence of concurrent bitmap modifications. Do I get it right you'd
rather prefer cloning all the find_*_bit() implementations to create
such variants? IMO that's worse both in terms of maintainability (more
code) and usability (users have to be aware special functions are needed
for lockless code) so the 2% of performance overhead until gcc is fixed
isn't IMO worth it but you are the maintainer...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR