2010-11-11 13:49:58

by Cyberman Wu

[permalink] [raw]
Subject: Kernel rwlock design, Multicore and IGMP

I'm using TILEPro and its rwlock in kernel is a liitle different than
other platforms. It have a priority for write lock that when tried it
will block the following read lock even if read lock is hold by
others. Its code can be read in Linux Kernel 2.6.36 in
arch/tile/lib/spinlock_32.c.

That different could cause a deadlock in kernel if we join/leave
Multicast Group simultaneous and frequently on mutlicores. IGMP
message is sent by

igmp_ifc_timer_expire() -> igmpv3_send_cr() -> igmpv3_sendpack()

in timer interrupt, igmpv3_send_cr() will generate the sk_buff for
IGMP message with mc_list_lock read locked and then call
igmpv3_sendpack() with it unlocked.
But if we have so many join/leave messages have to generate and it
can't be sent in one sk_buff then igmpv3_send_cr() -> add_grec() will
call igmpv3_sendpack() to send it and reallocate a new buffer. When
the message is sent:

__mkroute_output() -> ip_check_mc()

will read lock mc_list_lock again. If there is another core is try
write lock mc_list_lock between the two read lock, then deadlock
ocurred.

The rwlock on other platforms I've check, say, PowerPC, x86, ARM, is
just read lock shared and write_lock mutex, so if we've hold read lock
the write lock will just wait, and if there have a read lock again it
will success.

So, What's the criteria of rwlock design in the Linux kernel? Is that
read lock re-hold of IGMP a design error in Linux kernel, or the read
lock has to be design like that?

There is a other thing, that the timer interrupt will start timer on
the same in_dev, should that be optimized?

BTW: If we have so many cores, say 64, is there other things we have
to think about spinlock? If there have collisions ocurred, should we
just read the shared memory again and again, or just a very little
'delay' is better? I've seen relax() is called in the implementation
of spinlock on TILEPro platform.


2010-11-11 15:23:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :

Hi

CC netdev, since you ask questions about network stuff _and_ rwlock


> I'm using TILEPro and its rwlock in kernel is a liitle different than
> other platforms. It have a priority for write lock that when tried it
> will block the following read lock even if read lock is hold by
> others. Its code can be read in Linux Kernel 2.6.36 in
> arch/tile/lib/spinlock_32.c.

This seems a bug to me.

read_lock() can be nested. We used such a schem in the past in iptables
(it can re-enter itself),
and we used instead a spinlock(), but with many discussions with lkml
and Linus himself if I remember well.


>
> That different could cause a deadlock in kernel if we join/leave
> Multicast Group simultaneous and frequently on mutlicores. IGMP
> message is sent by
>
> igmp_ifc_timer_expire() -> igmpv3_send_cr() -> igmpv3_sendpack()
>
> in timer interrupt, igmpv3_send_cr() will generate the sk_buff for
> IGMP message with mc_list_lock read locked and then call
> igmpv3_sendpack() with it unlocked.
> But if we have so many join/leave messages have to generate and it
> can't be sent in one sk_buff then igmpv3_send_cr() -> add_grec() will
> call igmpv3_sendpack() to send it and reallocate a new buffer. When
> the message is sent:
>
> __mkroute_output() -> ip_check_mc()
>
> will read lock mc_list_lock again. If there is another core is try
> write lock mc_list_lock between the two read lock, then deadlock
> ocurred.
>
> The rwlock on other platforms I've check, say, PowerPC, x86, ARM, is
> just read lock shared and write_lock mutex, so if we've hold read lock
> the write lock will just wait, and if there have a read lock again it
> will success.
>
> So, What's the criteria of rwlock design in the Linux kernel? Is that
> read lock re-hold of IGMP a design error in Linux kernel, or the read
> lock has to be design like that?
>

Well, we try to get rid of all rwlocks in performance critical sections.

I would say, if you believe one rwlock can justify the special TILE
behavior you tried to make, then we should instead migrate this rwlock
to a RCU + spinlock schem (so that all arches benefit from this work,
not only TILE)

> There is a other thing, that the timer interrupt will start timer on
> the same in_dev, should that be optimized?
>

Not sure I understand what you mean.

> BTW: If we have so many cores, say 64, is there other things we have
> to think about spinlock? If there have collisions ocurred, should we
> just read the shared memory again and again, or just a very little
> 'delay' is better? I've seen relax() is called in the implementation
> of spinlock on TILEPro platform.
> --

Is TILE using ticket spinlocks ?

2010-11-11 15:32:27

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

Le jeudi 11 novembre 2010 à 16:23 +0100, Eric Dumazet a écrit :
> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>
> Hi
>
> CC netdev, since you ask questions about network stuff _and_ rwlock
>
>
> > I'm using TILEPro and its rwlock in kernel is a liitle different than
> > other platforms. It have a priority for write lock that when tried it
> > will block the following read lock even if read lock is hold by
> > others. Its code can be read in Linux Kernel 2.6.36 in
> > arch/tile/lib/spinlock_32.c.
>
> This seems a bug to me.
>
> read_lock() can be nested. We used such a schem in the past in iptables
> (it can re-enter itself),
> and we used instead a spinlock(), but with many discussions with lkml
> and Linus himself if I remember well.

I meant, a percpu spinlock, and extra logic to spin_lock() it one time,
even if nested.

static inline void xt_info_rdlock_bh(void)
{
struct xt_info_lock *lock;

local_bh_disable();
lock = &__get_cpu_var(xt_info_locks);
if (likely(!lock->readers++))
spin_lock(&lock->lock);
}

static inline void xt_info_rdunlock_bh(void)
{
struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);

if (likely(!--lock->readers))
spin_unlock(&lock->lock);
local_bh_enable();
}

The write 'rwlock' side has to lock the percpu spinlock of all possible
cpus.

/*
* The "writer" side needs to get exclusive access to the lock,
* regardless of readers. This must be called with bottom half
* processing (and thus also preemption) disabled.
*/
static inline void xt_info_wrlock(unsigned int cpu)
{
spin_lock(&per_cpu(xt_info_locks, cpu).lock);
}

static inline void xt_info_wrunlock(unsigned int cpu)
{
spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
}


2010-11-12 03:33:04

by Cyberman Wu

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
> Le jeudi 11 novembre 2010 ? 21:49 +0800, Cypher Wu a ?crit :
>
> Hi
>
> CC netdev, since you ask questions about network stuff _and_ rwlock
>
>
>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> other platforms. It have a priority for write lock that when tried it
>> will block the following read lock even if read lock is hold by
>> others. Its code can be read in Linux Kernel 2.6.36 in
>> arch/tile/lib/spinlock_32.c.
>
> This seems a bug to me.
>
> read_lock() can be nested. We used such a schem in the past in iptables
> (it can re-enter itself),
> and we used instead a spinlock(), but with many discussions with lkml
> and Linus himself if I remember well.
>
It seems not a problem that read_lock() can be nested or not since
rwlock doesn't have 'owner', it's just that should we give
write_lock() a priority than read_lock() since if there have a lot
read_lock()s then they'll starve write_lock().
We should work out a well defined behavior so all the
platform-dependent raw_rwlock has to design under that principle.
>
>>
>> That different could cause a deadlock in kernel if we join/leave
>> Multicast Group simultaneous and frequently on mutlicores. IGMP
>> message is sent by
>>
>> igmp_ifc_timer_expire() -> igmpv3_send_cr() -> igmpv3_sendpack()
>>
>> in timer interrupt, igmpv3_send_cr() will generate the sk_buff for
>> IGMP message with mc_list_lock read locked and then call
>> igmpv3_sendpack() with it unlocked.
>> But if we have so many join/leave messages have to generate and it
>> can't be sent in one sk_buff then igmpv3_send_cr() -> add_grec() will
>> call igmpv3_sendpack() to send it and reallocate a new buffer. When
>> the message is sent:
>>
>> __mkroute_output() -> ip_check_mc()
>>
>> will read lock mc_list_lock again. If there is another core is try
>> write lock mc_list_lock between the two read lock, then deadlock
>> ocurred.
>>
>> The rwlock on other platforms I've check, say, PowerPC, x86, ARM, is
>> just read lock shared and write_lock mutex, so if we've hold read lock
>> the write lock will just wait, and if there have a read lock again it
>> will success.
>>
>> So, What's the criteria of rwlock design in the Linux kernel? Is that
>> read lock re-hold of IGMP a design error in Linux kernel, or the read
>> lock has to be design like that?
>>
>
> Well, we try to get rid of all rwlocks in performance critical sections.
>
> I would say, if you believe one rwlock can justify the special TILE
> behavior you tried to make, then we should instead migrate this rwlock
> to a RCU + spinlock schem (so that all arches benefit from this work,
> not only TILE)

IGMP in not very performance critical so maybe rwlock is enough?

>
>> There is a other thing, that the timer interrupt will start timer on
>> the same in_dev, should that be optimized?
>>
>
> Not sure I understand what you mean.

Since mc_list & mc_tomb is shared list in the kernel we needn't to
start multiple timers on different cores for them, we can synchronize
it on one core.

>
>> BTW: If we have so many cores, say 64, is there other things we have
>> to think about spinlock? If there have collisions ocurred, should we
>> just read the shared memory again and again, or just a very little
>> 'delay' is better? I've seen relax() is called in the implementation
>> of spinlock on TILEPro platform.
>> --
>
> Is TILE using ticket spinlocks ?
>

What ticket spinlocks means? Could you explain a little, pls:) I'm not
familiar with Linux kernel, I'm trying to get more understanding of it
since I'm working on Linux platform now.

>
>

2010-11-12 06:24:19

by Cong Wang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>
>> Is TILE using ticket spinlocks ?
>>

Eric, yes.


>
>What ticket spinlocks means? Could you explain a little, pls:) I'm not
>familiar with Linux kernel, I'm trying to get more understanding of it
>since I'm working on Linux platform now.
>

You might want to search "ticket spinlock" with google. :)

Briefly speaking, it is introduced to solve unfairness
of the old spinlock implementation. In linux kernel, not all
arch implement this. X86 implementation is done by Nick with
asm code, while tile uses C to implement it.

2010-11-12 07:08:41

by Cong Wang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>
>> Hi
>>
>> CC netdev, since you ask questions about network stuff _and_ rwlock
>>
>>
>>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>> other platforms. It have a priority for write lock that when tried it
>>> will block the following read lock even if read lock is hold by
>>> others. Its code can be read in Linux Kernel 2.6.36 in
>>> arch/tile/lib/spinlock_32.c.
>>
>> This seems a bug to me.
>>
>> read_lock() can be nested. We used such a schem in the past in iptables
>> (it can re-enter itself),
>> and we used instead a spinlock(), but with many discussions with lkml
>> and Linus himself if I remember well.
>>
>It seems not a problem that read_lock() can be nested or not since
>rwlock doesn't have 'owner', it's just that should we give
>write_lock() a priority than read_lock() since if there have a lot
>read_lock()s then they'll starve write_lock().
>We should work out a well defined behavior so all the
>platform-dependent raw_rwlock has to design under that principle.

It is a known weakness of rwlock, it is designed like that. :)

The solution is to use RCU or seqlock, but I don't think seqlock
is proper for this case you described. So, try RCU lock.

2010-11-12 07:28:00

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
> >>
> >> Hi
> >>
> >> CC netdev, since you ask questions about network stuff _and_ rwlock
> >>
> >>
> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
> >>> other platforms. It have a priority for write lock that when tried it
> >>> will block the following read lock even if read lock is hold by
> >>> others. Its code can be read in Linux Kernel 2.6.36 in
> >>> arch/tile/lib/spinlock_32.c.
> >>
> >> This seems a bug to me.
> >>
> >> read_lock() can be nested. We used such a schem in the past in iptables
> >> (it can re-enter itself),
> >> and we used instead a spinlock(), but with many discussions with lkml
> >> and Linus himself if I remember well.
> >>
> >It seems not a problem that read_lock() can be nested or not since
> >rwlock doesn't have 'owner', it's just that should we give
> >write_lock() a priority than read_lock() since if there have a lot
> >read_lock()s then they'll starve write_lock().
> >We should work out a well defined behavior so all the
> >platform-dependent raw_rwlock has to design under that principle.
>

AFAIK, Lockdep allows read_lock() to be nested.

> It is a known weakness of rwlock, it is designed like that. :)
>

Agreed.

> The solution is to use RCU or seqlock, but I don't think seqlock
> is proper for this case you described. So, try RCU lock.

In the IGMP case, it should be easy for the task owning a read_lock() to
pass a parameter to the called function saying 'I already own the
read_lock(), dont try to re-acquire it'

A RCU conversion is far more complex.


2010-11-12 08:15:25

by Cong Wang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>> >>
>> >> Hi
>> >>
>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>> >>
>> >>
>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> >>> other platforms. It have a priority for write lock that when tried it
>> >>> will block the following read lock even if read lock is hold by
>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>> >>> arch/tile/lib/spinlock_32.c.
>> >>
>> >> This seems a bug to me.
>> >>
>> >> read_lock() can be nested. We used such a schem in the past in iptables
>> >> (it can re-enter itself),
>> >> and we used instead a spinlock(), but with many discussions with lkml
>> >> and Linus himself if I remember well.
>> >>
>> >It seems not a problem that read_lock() can be nested or not since
>> >rwlock doesn't have 'owner', it's just that should we give
>> >write_lock() a priority than read_lock() since if there have a lot
>> >read_lock()s then they'll starve write_lock().
>> >We should work out a well defined behavior so all the
>> >platform-dependent raw_rwlock has to design under that principle.
>>
>
>AFAIK, Lockdep allows read_lock() to be nested.
>
>> It is a known weakness of rwlock, it is designed like that. :)
>>
>
>Agreed.
>

Just for record, both Tile and X86 implement rwlock with a write-bias,
this somewhat reduces the write-starvation problem.


>> The solution is to use RCU or seqlock, but I don't think seqlock
>> is proper for this case you described. So, try RCU lock.
>
>In the IGMP case, it should be easy for the task owning a read_lock() to
>pass a parameter to the called function saying 'I already own the
>read_lock(), dont try to re-acquire it'
>
>A RCU conversion is far more complex.
>

Yup.

2010-11-12 09:09:51

by Yong Zhang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <[email protected]> wrote:
> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>> >>
>>> >> Hi
>>> >>
>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>> >>
>>> >>
>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>> >>> other platforms. It have a priority for write lock that when tried it
>>> >>> will block the following read lock even if read lock is hold by
>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>> >>> arch/tile/lib/spinlock_32.c.
>>> >>
>>> >> This seems a bug to me.
>>> >>
>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>> >> (it can re-enter itself),
>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>> >> and Linus himself if I remember well.
>>> >>
>>> >It seems not a problem that read_lock() can be nested or not since
>>> >rwlock doesn't have 'owner', it's just that should we give
>>> >write_lock() a priority than read_lock() since if there have a lot
>>> >read_lock()s then they'll starve write_lock().
>>> >We should work out a well defined behavior so all the
>>> >platform-dependent raw_rwlock has to design under that principle.
>>>
>>
>>AFAIK, Lockdep allows read_lock() to be nested.
>>
>>> It is a known weakness of rwlock, it is designed like that. :)
>>>
>>
>>Agreed.
>>
>
> Just for record, both Tile and X86 implement rwlock with a write-bias,
> this somewhat reduces the write-starvation problem.

Are you sure(on x86)?

It seems that we never realize writer-bias rwlock.

Thanks,
Yong
>
>
>>> The solution is to use RCU or seqlock, but I don't think seqlock
>>> is proper for this case you described. So, try RCU lock.
>>
>>In the IGMP case, it should be easy for the task owning a read_lock() to
>>pass a parameter to the called function saying 'I already own the
>>read_lock(), dont try to re-acquire it'
>>
>>A RCU conversion is far more complex.
>>
>
> Yup.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

2010-11-12 09:13:40

by Cong Wang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
>On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <[email protected]> wrote:
>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>>> >>
>>>> >> Hi
>>>> >>
>>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>>> >>
>>>> >>
>>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>> >>> other platforms. It have a priority for write lock that when tried it
>>>> >>> will block the following read lock even if read lock is hold by
>>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>> >>> arch/tile/lib/spinlock_32.c.
>>>> >>
>>>> >> This seems a bug to me.
>>>> >>
>>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>>> >> (it can re-enter itself),
>>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>>> >> and Linus himself if I remember well.
>>>> >>
>>>> >It seems not a problem that read_lock() can be nested or not since
>>>> >rwlock doesn't have 'owner', it's just that should we give
>>>> >write_lock() a priority than read_lock() since if there have a lot
>>>> >read_lock()s then they'll starve write_lock().
>>>> >We should work out a well defined behavior so all the
>>>> >platform-dependent raw_rwlock has to design under that principle.
>>>>
>>>
>>>AFAIK, Lockdep allows read_lock() to be nested.
>>>
>>>> It is a known weakness of rwlock, it is designed like that. :)
>>>>
>>>
>>>Agreed.
>>>
>>
>> Just for record, both Tile and X86 implement rwlock with a write-bias,
>> this somewhat reduces the write-starvation problem.
>
>Are you sure(on x86)?
>
>It seems that we never realize writer-bias rwlock.
>

Try

% grep RW_LOCK_BIAS -nr arch/x86

*And* read the code to see how it works. :)

Note, on Tile, it uses a little different algorithm.

2010-11-12 09:22:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:

> >A RCU conversion is far more complex.
> >
>
> Yup.


Well, actually this is easy in this case.

I'll post a patch to do this RCU conversion.


2010-11-12 09:29:03

by Cong Wang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 10:22:39AM +0100, Eric Dumazet wrote:
>Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>
>> >A RCU conversion is far more complex.
>> >
>>
>> Yup.
>
>
>Well, actually this is easy in this case.
>
>I'll post a patch to do this RCU conversion.
>

Cool! Please keep me in Cc.

Some conversions from rwlock to RCU are indeed complex,
don't know about this case.

Anyway, patch please. :)

2010-11-12 11:06:50

by Cyberman Wu

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

2010/11/12 Am?rico Wang <[email protected]>:
> On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
>>On Fri, Nov 12, 2010 at 4:19 PM, Am?rico Wang <[email protected]> wrote:
>>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>>>Le vendredi 12 novembre 2010 ? 15:13 +0800, Am?rico Wang a ?crit :
>>>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>>>>> >> Le jeudi 11 novembre 2010 ? 21:49 +0800, Cypher Wu a ?crit :
>>>>> >>
>>>>> >> Hi
>>>>> >>
>>>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>>>> >>
>>>>> >>
>>>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>>> >>> other platforms. It have a priority for write lock that when tried it
>>>>> >>> will block the following read lock even if read lock is hold by
>>>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>>> >>> arch/tile/lib/spinlock_32.c.
>>>>> >>
>>>>> >> This seems a bug to me.
>>>>> >>
>>>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>>>> >> (it can re-enter itself),
>>>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>>>> >> and Linus himself if I remember well.
>>>>> >>
>>>>> >It seems not a problem that read_lock() can be nested or not since
>>>>> >rwlock doesn't have 'owner', it's just that should we give
>>>>> >write_lock() a priority than read_lock() since if there have a lot
>>>>> >read_lock()s then they'll starve write_lock().
>>>>> >We should work out a well defined behavior so all the
>>>>> >platform-dependent raw_rwlock has to design under that principle.
>>>>>
>>>>
>>>>AFAIK, Lockdep allows read_lock() to be nested.
>>>>
>>>>> It is a known weakness of rwlock, it is designed like that. :)
>>>>>
>>>>
>>>>Agreed.
>>>>
>>>
>>> Just for record, both Tile and X86 implement rwlock with a write-bias,
>>> this somewhat reduces the write-starvation problem.
>>
>>Are you sure(on x86)?
>>
>>It seems that we never realize writer-bias rwlock.
>>
>
> Try
>
> % grep RW_LOCK_BIAS -nr arch/x86
>
> *And* read the code to see how it works. :)
>
> Note, on Tile, it uses a little different algorithm.
>

It seems that rwlock on x86 and tile have different behavior, x86 use
RW_LOCK_BIAS, when read_lock() it will test if the lock is 0, and if
so then the read_lock() have to 'spinning', otherwise it dec the lock;
when write_lock() tried it first check if lock is It seems that rwlock
on x86 and tile have different behavior, x86 use RW_LOCK_BIAS and if
so, set lock to 0 and continue, otherwise it will 'spinning'.
I'm not very familiar with x86 architecture, but the code seems like
working that way.

2010-11-12 11:11:01

by Cyberman Wu

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 3:27 PM, Eric Dumazet <[email protected]> wrote:
> Le vendredi 12 novembre 2010 ? 15:13 +0800, Am?rico Wang a ?crit :
>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>> >> Le jeudi 11 novembre 2010 ? 21:49 +0800, Cypher Wu a ?crit :
>> >>
>> >> Hi
>> >>
>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>> >>
>> >>
>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> >>> other platforms. It have a priority for write lock that when tried it
>> >>> will block the following read lock even if read lock is hold by
>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>> >>> arch/tile/lib/spinlock_32.c.
>> >>
>> >> This seems a bug to me.
>> >>
>> >> read_lock() can be nested. We used such a schem in the past in iptables
>> >> (it can re-enter itself),
>> >> and we used instead a spinlock(), but with many discussions with lkml
>> >> and Linus himself if I remember well.
>> >>
>> >It seems not a problem that read_lock() can be nested or not since
>> >rwlock doesn't have 'owner', it's just that should we give
>> >write_lock() a priority than read_lock() since if there have a lot
>> >read_lock()s then they'll starve write_lock().
>> >We should work out a well defined behavior so all the
>> >platform-dependent raw_rwlock has to design under that principle.
>>
>
> AFAIK, Lockdep allows read_lock() to be nested.
>
>> It is a known weakness of rwlock, it is designed like that. :)
>>
>
> Agreed.
>
>> The solution is to use RCU or seqlock, but I don't think seqlock
>> is proper for this case you described. So, try RCU lock.
>
> In the IGMP case, it should be easy for the task owning a read_lock() to
> pass a parameter to the called function saying 'I already own the
> read_lock(), dont try to re-acquire it'

I used to using that way, just seperate the call internal and
external, with external one hold lock then call the internal one. But
in that case ip_check_mc() is called indirectly from igmpv3_sendpack()
and is not very clear how to give the different paramter?

>
> A RCU conversion is far more complex.
>
>
>
>

2010-11-12 11:25:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

Le vendredi 12 novembre 2010 à 19:10 +0800, Cypher Wu a écrit :

> I used to using that way, just seperate the call internal and
> external, with external one hold lock then call the internal one. But
> in that case ip_check_mc() is called indirectly from igmpv3_sendpack()
> and is not very clear how to give the different paramter?

I said that I was preparing a RCU patch, dont bother with this ;)

Should be ready in a couple of minutes.

Thanks

2010-11-12 13:00:32

by Yong Zhang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 05:18:18PM +0800, Américo Wang wrote:
> On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
> >On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <[email protected]> wrote:
> >> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
> >>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
> >>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
> >>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
> >>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
> >>>> >>
> >>>> >> Hi
> >>>> >>
> >>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
> >>>> >>
> >>>> >>
> >>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
> >>>> >>> other platforms. It have a priority for write lock that when tried it
> >>>> >>> will block the following read lock even if read lock is hold by
> >>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
> >>>> >>> arch/tile/lib/spinlock_32.c.
> >>>> >>
> >>>> >> This seems a bug to me.
> >>>> >>
> >>>> >> read_lock() can be nested. We used such a schem in the past in iptables
> >>>> >> (it can re-enter itself),
> >>>> >> and we used instead a spinlock(), but with many discussions with lkml
> >>>> >> and Linus himself if I remember well.
> >>>> >>
> >>>> >It seems not a problem that read_lock() can be nested or not since
> >>>> >rwlock doesn't have 'owner', it's just that should we give
> >>>> >write_lock() a priority than read_lock() since if there have a lot
> >>>> >read_lock()s then they'll starve write_lock().
> >>>> >We should work out a well defined behavior so all the
> >>>> >platform-dependent raw_rwlock has to design under that principle.
> >>>>
> >>>
> >>>AFAIK, Lockdep allows read_lock() to be nested.
> >>>
> >>>> It is a known weakness of rwlock, it is designed like that. :)
> >>>>
> >>>
> >>>Agreed.
> >>>
> >>
> >> Just for record, both Tile and X86 implement rwlock with a write-bias,
> >> this somewhat reduces the write-starvation problem.
> >
> >Are you sure(on x86)?
> >
> >It seems that we never realize writer-bias rwlock.
> >
>
> Try
>
> % grep RW_LOCK_BIAS -nr arch/x86
>
> *And* read the code to see how it works. :)

If read_lock()/write_lock() fails, the subtracted value(1 for
read_lock() and RW_LOCK_BIAS for write_lock()) is added back.
So reader and writer will contend on the same lock fairly.

And RW_LOCK_BIAS based rwlock is a variant of sighed-test
rwlock, so it works in the same way to highest-bit-set mode
rwlock.

Seem you're cheated by it's name(RW_LOCK_BIAS). :)
Or am I missing something?

Thanks,
Yong

>
> Note, on Tile, it uses a little different algorithm.

2010-11-12 13:34:26

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list

Le vendredi 12 novembre 2010 à 10:22 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> > On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>
> > >A RCU conversion is far more complex.
> > >
> >
> > Yup.
>
>
> Well, actually this is easy in this case.
>
> I'll post a patch to do this RCU conversion.
>
>

Note : compile tested only, I'll appreciate if someone can test it ;)

Note: one patch from net-2.6 is not yet included in net-next-2.6, so
please make sure you have it before testing ;)

( http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=18943d292facbc70e6a36fc62399ae833f64671b )


Thanks

[PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list

in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).

This can easily be converted to a RCU protection.

Writers hold RTNL, so mc_list_lock is removed, not replaced by a
spinlock.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Cypher Wu <[email protected]>
Cc: Américo Wang <[email protected]>
---
include/linux/igmp.h | 12 +
include/linux/inetdevice.h | 5
include/net/inet_sock.h | 2
net/ipv4/igmp.c | 223 ++++++++++++++++-------------------
4 files changed, 115 insertions(+), 127 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 93fc244..7d16467 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -167,10 +167,10 @@ struct ip_sf_socklist {
*/

struct ip_mc_socklist {
- struct ip_mc_socklist *next;
+ struct ip_mc_socklist __rcu *next_rcu;
struct ip_mreqn multi;
unsigned int sfmode; /* MCAST_{INCLUDE,EXCLUDE} */
- struct ip_sf_socklist *sflist;
+ struct ip_sf_socklist __rcu *sflist;
struct rcu_head rcu;
};

@@ -186,11 +186,14 @@ struct ip_sf_list {
struct ip_mc_list {
struct in_device *interface;
__be32 multiaddr;
+ unsigned int sfmode;
struct ip_sf_list *sources;
struct ip_sf_list *tomb;
- unsigned int sfmode;
unsigned long sfcount[2];
- struct ip_mc_list *next;
+ union {
+ struct ip_mc_list *next;
+ struct ip_mc_list __rcu *next_rcu;
+ };
struct timer_list timer;
int users;
atomic_t refcnt;
@@ -201,6 +204,7 @@ struct ip_mc_list {
char loaded;
unsigned char gsquery; /* check source marks? */
unsigned char crcount;
+ struct rcu_head rcu;
};

/* V3 exponential field decoding */
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ccd5b07..380ba6b 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -52,9 +52,8 @@ struct in_device {
atomic_t refcnt;
int dead;
struct in_ifaddr *ifa_list; /* IP ifaddr chain */
- rwlock_t mc_list_lock;
- struct ip_mc_list *mc_list; /* IP multicast filter chain */
- int mc_count; /* Number of installed mcasts */
+ struct ip_mc_list __rcu *mc_list; /* IP multicast filter chain */
+ int mc_count; /* Number of installed mcasts */
spinlock_t mc_tomb_lock;
struct ip_mc_list *mc_tomb;
unsigned long mr_v1_seen;
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 1989cfd..8945f9f 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -141,7 +141,7 @@ struct inet_sock {
nodefrag:1;
int mc_index;
__be32 mc_addr;
- struct ip_mc_socklist *mc_list;
+ struct ip_mc_socklist __rcu *mc_list;
struct {
unsigned int flags;
unsigned int fragsize;
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 08d0d81..ff4e5fd 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -149,11 +149,17 @@ static void ip_mc_clear_src(struct ip_mc_list *pmc);
static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
int sfcount, __be32 *psfsrc, int delta);

+
+static void ip_mc_list_reclaim(struct rcu_head *head)
+{
+ kfree(container_of(head, struct ip_mc_list, rcu));
+}
+
static void ip_ma_put(struct ip_mc_list *im)
{
if (atomic_dec_and_test(&im->refcnt)) {
in_dev_put(im->interface);
- kfree(im);
+ call_rcu(&im->rcu, ip_mc_list_reclaim);
}
}

@@ -163,7 +169,7 @@ static void ip_ma_put(struct ip_mc_list *im)
* Timer management
*/

-static __inline__ void igmp_stop_timer(struct ip_mc_list *im)
+static void igmp_stop_timer(struct ip_mc_list *im)
{
spin_lock_bh(&im->lock);
if (del_timer(&im->timer))
@@ -496,14 +502,24 @@ empty_source:
return skb;
}

+#define for_each_pmc_rcu(in_dev, pmc) \
+ for (pmc = rcu_dereference(in_dev->mc_list); \
+ pmc != NULL; \
+ pmc = rcu_dereference(pmc->next_rcu))
+
+#define for_each_pmc_rtnl(in_dev, pmc) \
+ for (pmc = rtnl_dereference(in_dev->mc_list); \
+ pmc != NULL; \
+ pmc = rtnl_dereference(pmc->next_rcu))
+
static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
{
struct sk_buff *skb = NULL;
int type;

if (!pmc) {
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
if (pmc->multiaddr == IGMP_ALL_HOSTS)
continue;
spin_lock_bh(&pmc->lock);
@@ -514,7 +530,7 @@ static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
skb = add_grec(skb, pmc, type, 0, 0);
spin_unlock_bh(&pmc->lock);
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
} else {
spin_lock_bh(&pmc->lock);
if (pmc->sfcount[MCAST_EXCLUDE])
@@ -556,7 +572,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
struct sk_buff *skb = NULL;
int type, dtype;

- read_lock(&in_dev->mc_list_lock);
+ rcu_read_lock();
spin_lock_bh(&in_dev->mc_tomb_lock);

/* deleted MCA's */
@@ -593,7 +609,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
spin_unlock_bh(&in_dev->mc_tomb_lock);

/* change recs */
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rcu(in_dev, pmc) {
spin_lock_bh(&pmc->lock);
if (pmc->sfcount[MCAST_EXCLUDE]) {
type = IGMPV3_BLOCK_OLD_SOURCES;
@@ -616,7 +632,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
}
spin_unlock_bh(&pmc->lock);
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();

if (!skb)
return;
@@ -813,14 +829,14 @@ static void igmp_heard_report(struct in_device *in_dev, __be32 group)
if (group == IGMP_ALL_HOSTS)
return;

- read_lock(&in_dev->mc_list_lock);
- for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, im) {
if (im->multiaddr == group) {
igmp_stop_timer(im);
break;
}
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
}

static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
@@ -906,8 +922,8 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
* - Use the igmp->igmp_code field as the maximum
* delay possible
*/
- read_lock(&in_dev->mc_list_lock);
- for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, im) {
int changed;

if (group && group != im->multiaddr)
@@ -925,7 +941,7 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
if (changed)
igmp_mod_timer(im, max_delay);
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
}

/* called in rcu_read_lock() section */
@@ -1110,8 +1126,8 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
kfree(pmc);
}
/* clear dead sources, too */
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
struct ip_sf_list *psf, *psf_next;

spin_lock_bh(&pmc->lock);
@@ -1123,7 +1139,7 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
kfree(psf);
}
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
}
#endif

@@ -1209,7 +1225,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)

ASSERT_RTNL();

- for (im=in_dev->mc_list; im; im=im->next) {
+ for_each_pmc_rtnl(in_dev, im) {
if (im->multiaddr == addr) {
im->users++;
ip_mc_add_src(in_dev, &addr, MCAST_EXCLUDE, 0, NULL, 0);
@@ -1217,7 +1233,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
}
}

- im = kmalloc(sizeof(*im), GFP_KERNEL);
+ im = kzalloc(sizeof(*im), GFP_KERNEL);
if (!im)
goto out;

@@ -1227,26 +1243,18 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
im->multiaddr = addr;
/* initial mode is (EX, empty) */
im->sfmode = MCAST_EXCLUDE;
- im->sfcount[MCAST_INCLUDE] = 0;
im->sfcount[MCAST_EXCLUDE] = 1;
- im->sources = NULL;
- im->tomb = NULL;
- im->crcount = 0;
atomic_set(&im->refcnt, 1);
spin_lock_init(&im->lock);
#ifdef CONFIG_IP_MULTICAST
- im->tm_running = 0;
setup_timer(&im->timer, &igmp_timer_expire, (unsigned long)im);
im->unsolicit_count = IGMP_Unsolicited_Report_Count;
- im->reporter = 0;
- im->gsquery = 0;
#endif
- im->loaded = 0;
- write_lock_bh(&in_dev->mc_list_lock);
- im->next = in_dev->mc_list;
- in_dev->mc_list = im;
+
+ im->next_rcu = in_dev->mc_list;
in_dev->mc_count++;
- write_unlock_bh(&in_dev->mc_list_lock);
+ rcu_assign_pointer(in_dev->mc_list, im);
+
#ifdef CONFIG_IP_MULTICAST
igmpv3_del_delrec(in_dev, im->multiaddr);
#endif
@@ -1287,17 +1295,18 @@ EXPORT_SYMBOL(ip_mc_rejoin_group);

void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
{
- struct ip_mc_list *i, **ip;
+ struct ip_mc_list *i;
+ struct ip_mc_list __rcu **ip;

ASSERT_RTNL();

- for (ip=&in_dev->mc_list; (i=*ip)!=NULL; ip=&i->next) {
+ for (ip = &in_dev->mc_list;
+ (i = rtnl_dereference(*ip)) != NULL;
+ ip = &i->next_rcu) {
if (i->multiaddr == addr) {
if (--i->users == 0) {
- write_lock_bh(&in_dev->mc_list_lock);
- *ip = i->next;
+ *ip = i->next_rcu;
in_dev->mc_count--;
- write_unlock_bh(&in_dev->mc_list_lock);
igmp_group_dropped(i);

if (!in_dev->dead)
@@ -1316,34 +1325,34 @@ EXPORT_SYMBOL(ip_mc_dec_group);

void ip_mc_unmap(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

- for (i = in_dev->mc_list; i; i = i->next)
- igmp_group_dropped(i);
+ for_each_pmc_rtnl(in_dev, pmc)
+ igmp_group_dropped(pmc);
}

void ip_mc_remap(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

- for (i = in_dev->mc_list; i; i = i->next)
- igmp_group_added(i);
+ for_each_pmc_rtnl(in_dev, pmc)
+ igmp_group_added(pmc);
}

/* Device going down */

void ip_mc_down(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

- for (i=in_dev->mc_list; i; i=i->next)
- igmp_group_dropped(i);
+ for_each_pmc_rtnl(in_dev, pmc)
+ igmp_group_dropped(pmc);

#ifdef CONFIG_IP_MULTICAST
in_dev->mr_ifc_count = 0;
@@ -1374,7 +1383,6 @@ void ip_mc_init_dev(struct in_device *in_dev)
in_dev->mr_qrv = IGMP_Unsolicited_Report_Count;
#endif

- rwlock_init(&in_dev->mc_list_lock);
spin_lock_init(&in_dev->mc_tomb_lock);
}

@@ -1382,14 +1390,14 @@ void ip_mc_init_dev(struct in_device *in_dev)

void ip_mc_up(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);

- for (i=in_dev->mc_list; i; i=i->next)
- igmp_group_added(i);
+ for_each_pmc_rtnl(in_dev, pmc);
+ igmp_group_added(pmc);
}

/*
@@ -1405,17 +1413,13 @@ void ip_mc_destroy_dev(struct in_device *in_dev)
/* Deactivate timers */
ip_mc_down(in_dev);

- write_lock_bh(&in_dev->mc_list_lock);
- while ((i = in_dev->mc_list) != NULL) {
- in_dev->mc_list = i->next;
+ while ((i = rtnl_dereference(in_dev->mc_list)) != NULL) {
+ in_dev->mc_list = i->next_rcu;
in_dev->mc_count--;
- write_unlock_bh(&in_dev->mc_list_lock);
+
igmp_group_dropped(i);
ip_ma_put(i);
-
- write_lock_bh(&in_dev->mc_list_lock);
}
- write_unlock_bh(&in_dev->mc_list_lock);
}

/* RTNL is locked */
@@ -1513,18 +1517,18 @@ static int ip_mc_del_src(struct in_device *in_dev, __be32 *pmca, int sfmode,

if (!in_dev)
return -ENODEV;
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
if (*pmca == pmc->multiaddr)
break;
}
if (!pmc) {
/* MCA not found?? bug */
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
return -ESRCH;
}
spin_lock_bh(&pmc->lock);
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
#ifdef CONFIG_IP_MULTICAST
sf_markstate(pmc);
#endif
@@ -1685,18 +1689,18 @@ static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,

if (!in_dev)
return -ENODEV;
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
if (*pmca == pmc->multiaddr)
break;
}
if (!pmc) {
/* MCA not found?? bug */
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
return -ESRCH;
}
spin_lock_bh(&pmc->lock);
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();

#ifdef CONFIG_IP_MULTICAST
sf_markstate(pmc);
@@ -1793,7 +1797,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)

err = -EADDRINUSE;
ifindex = imr->imr_ifindex;
- for (i = inet->mc_list; i; i = i->next) {
+ for_each_pmc_rtnl(inet, i) {
if (i->multi.imr_multiaddr.s_addr == addr &&
i->multi.imr_ifindex == ifindex)
goto done;
@@ -1807,7 +1811,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
goto done;

memcpy(&iml->multi, imr, sizeof(*imr));
- iml->next = inet->mc_list;
+ iml->next_rcu = inet->mc_list;
iml->sflist = NULL;
iml->sfmode = MCAST_EXCLUDE;
rcu_assign_pointer(inet->mc_list, iml);
@@ -1821,17 +1825,14 @@ EXPORT_SYMBOL(ip_mc_join_group);

static void ip_sf_socklist_reclaim(struct rcu_head *rp)
{
- struct ip_sf_socklist *psf;
-
- psf = container_of(rp, struct ip_sf_socklist, rcu);
+ kfree(container_of(rp, struct ip_sf_socklist, rcu));
/* sk_omem_alloc should have been decreased by the caller*/
- kfree(psf);
}

static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,
struct in_device *in_dev)
{
- struct ip_sf_socklist *psf = iml->sflist;
+ struct ip_sf_socklist *psf = rtnl_dereference(iml->sflist);
int err;

if (psf == NULL) {
@@ -1851,11 +1852,8 @@ static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,

static void ip_mc_socklist_reclaim(struct rcu_head *rp)
{
- struct ip_mc_socklist *iml;
-
- iml = container_of(rp, struct ip_mc_socklist, rcu);
+ kfree(container_of(rp, struct ip_mc_socklist, rcu));
/* sk_omem_alloc should have been decreased by the caller*/
- kfree(iml);
}


@@ -1866,7 +1864,8 @@ static void ip_mc_socklist_reclaim(struct rcu_head *rp)
int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
{
struct inet_sock *inet = inet_sk(sk);
- struct ip_mc_socklist *iml, **imlp;
+ struct ip_mc_socklist *iml;
+ struct ip_mc_socklist __rcu **imlp;
struct in_device *in_dev;
struct net *net = sock_net(sk);
__be32 group = imr->imr_multiaddr.s_addr;
@@ -1876,7 +1875,9 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
rtnl_lock();
in_dev = ip_mc_find_dev(net, imr);
ifindex = imr->imr_ifindex;
- for (imlp = &inet->mc_list; (iml = *imlp) != NULL; imlp = &iml->next) {
+ for (imlp = &inet->mc_list;
+ (iml = rtnl_dereference(*imlp)) != NULL;
+ imlp = &iml->next_rcu) {
if (iml->multi.imr_multiaddr.s_addr != group)
continue;
if (ifindex) {
@@ -1888,7 +1889,7 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)

(void) ip_mc_leave_src(sk, iml, in_dev);

- rcu_assign_pointer(*imlp, iml->next);
+ *imlp = iml->next_rcu;

if (in_dev)
ip_mc_dec_group(in_dev, group);
@@ -1934,7 +1935,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
}
err = -EADDRNOTAVAIL;

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if ((pmc->multi.imr_multiaddr.s_addr ==
imr.imr_multiaddr.s_addr) &&
(pmc->multi.imr_ifindex == imr.imr_ifindex))
@@ -1958,7 +1959,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
pmc->sfmode = omode;
}

- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
if (!add) {
if (!psl)
goto done; /* err = -EADDRNOTAVAIL */
@@ -2077,7 +2078,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
goto done;
}

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
pmc->multi.imr_ifindex == imr.imr_ifindex)
break;
@@ -2107,7 +2108,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
(void) ip_mc_add_src(in_dev, &msf->imsf_multiaddr,
msf->imsf_fmode, 0, NULL, 0);
}
- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
if (psl) {
(void) ip_mc_del_src(in_dev, &msf->imsf_multiaddr, pmc->sfmode,
psl->sl_count, psl->sl_addr, 0);
@@ -2155,7 +2156,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
}
err = -EADDRNOTAVAIL;

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
pmc->multi.imr_ifindex == imr.imr_ifindex)
break;
@@ -2163,7 +2164,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
if (!pmc) /* must have a prior join */
goto done;
msf->imsf_fmode = pmc->sfmode;
- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
rtnl_unlock();
if (!psl) {
len = 0;
@@ -2208,7 +2209,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,

err = -EADDRNOTAVAIL;

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == addr &&
pmc->multi.imr_ifindex == gsf->gf_interface)
break;
@@ -2216,7 +2217,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,
if (!pmc) /* must have a prior join */
goto done;
gsf->gf_fmode = pmc->sfmode;
- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
rtnl_unlock();
count = psl ? psl->sl_count : 0;
copycount = count < gsf->gf_numsrc ? count : gsf->gf_numsrc;
@@ -2257,7 +2258,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
goto out;

rcu_read_lock();
- for (pmc=rcu_dereference(inet->mc_list); pmc; pmc=rcu_dereference(pmc->next)) {
+ for_each_pmc_rcu(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == loc_addr &&
pmc->multi.imr_ifindex == dif)
break;
@@ -2265,7 +2266,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
ret = inet->mc_all;
if (!pmc)
goto unlock;
- psl = pmc->sflist;
+ psl = rcu_dereference(pmc->sflist);
ret = (pmc->sfmode == MCAST_EXCLUDE);
if (!psl)
goto unlock;
@@ -2300,10 +2301,10 @@ void ip_mc_drop_socket(struct sock *sk)
return;

rtnl_lock();
- while ((iml = inet->mc_list) != NULL) {
+ while ((iml = rtnl_dereference(inet->mc_list)) != NULL) {
struct in_device *in_dev;
- rcu_assign_pointer(inet->mc_list, iml->next);

+ inet->mc_list = iml->next_rcu;
in_dev = inetdev_by_index(net, iml->multi.imr_ifindex);
(void) ip_mc_leave_src(sk, iml, in_dev);
if (in_dev != NULL) {
@@ -2323,8 +2324,8 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
struct ip_sf_list *psf;
int rv = 0;

- read_lock(&in_dev->mc_list_lock);
- for (im=in_dev->mc_list; im; im=im->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, im) {
if (im->multiaddr == mc_addr)
break;
}
@@ -2345,7 +2346,7 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
} else
rv = 1; /* unspecified source; tentatively allow */
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
return rv;
}

@@ -2371,13 +2372,11 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
in_dev = __in_dev_get_rcu(state->dev);
if (!in_dev)
continue;
- read_lock(&in_dev->mc_list_lock);
- im = in_dev->mc_list;
+ im = rcu_dereference(in_dev->mc_list);
if (im) {
state->in_dev = in_dev;
break;
}
- read_unlock(&in_dev->mc_list_lock);
}
return im;
}
@@ -2385,11 +2384,9 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_list *im)
{
struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
- im = im->next;
- while (!im) {
- if (likely(state->in_dev != NULL))
- read_unlock(&state->in_dev->mc_list_lock);

+ im = rcu_dereference(im->next_rcu);
+ while (!im) {
state->dev = next_net_device_rcu(state->dev);
if (!state->dev) {
state->in_dev = NULL;
@@ -2398,8 +2395,7 @@ static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_li
state->in_dev = __in_dev_get_rcu(state->dev);
if (!state->in_dev)
continue;
- read_lock(&state->in_dev->mc_list_lock);
- im = state->in_dev->mc_list;
+ im = rcu_dereference(state->in_dev->mc_list);
}
return im;
}
@@ -2435,10 +2431,8 @@ static void igmp_mc_seq_stop(struct seq_file *seq, void *v)
__releases(rcu)
{
struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
- if (likely(state->in_dev != NULL)) {
- read_unlock(&state->in_dev->mc_list_lock);
- state->in_dev = NULL;
- }
+
+ state->in_dev = NULL;
state->dev = NULL;
rcu_read_unlock();
}
@@ -2460,7 +2454,7 @@ static int igmp_mc_seq_show(struct seq_file *seq, void *v)
querier = "NONE";
#endif

- if (state->in_dev->mc_list == im) {
+ if (rcu_dereference(state->in_dev->mc_list) == im) {
seq_printf(seq, "%d\t%-10s: %5d %7s\n",
state->dev->ifindex, state->dev->name, state->in_dev->mc_count, querier);
}
@@ -2519,8 +2513,7 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
idev = __in_dev_get_rcu(state->dev);
if (unlikely(idev == NULL))
continue;
- read_lock(&idev->mc_list_lock);
- im = idev->mc_list;
+ im = rcu_dereference(idev->mc_list);
if (likely(im != NULL)) {
spin_lock_bh(&im->lock);
psf = im->sources;
@@ -2531,7 +2524,6 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
}
spin_unlock_bh(&im->lock);
}
- read_unlock(&idev->mc_list_lock);
}
return psf;
}
@@ -2545,9 +2537,6 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
spin_unlock_bh(&state->im->lock);
state->im = state->im->next;
while (!state->im) {
- if (likely(state->idev != NULL))
- read_unlock(&state->idev->mc_list_lock);
-
state->dev = next_net_device_rcu(state->dev);
if (!state->dev) {
state->idev = NULL;
@@ -2556,8 +2545,7 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
state->idev = __in_dev_get_rcu(state->dev);
if (!state->idev)
continue;
- read_lock(&state->idev->mc_list_lock);
- state->im = state->idev->mc_list;
+ state->im = rcu_dereference(state->idev->mc_list);
}
if (!state->im)
break;
@@ -2603,10 +2591,7 @@ static void igmp_mcf_seq_stop(struct seq_file *seq, void *v)
spin_unlock_bh(&state->im->lock);
state->im = NULL;
}
- if (likely(state->idev != NULL)) {
- read_unlock(&state->idev->mc_list_lock);
- state->idev = NULL;
- }
+ state->idev = NULL;
state->dev = NULL;
rcu_read_unlock();
}

2010-11-12 14:26:26

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list

Le vendredi 12 novembre 2010 à 14:34 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 10:22 +0100, Eric Dumazet a écrit :
> > Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> > > On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
> >
> > > >A RCU conversion is far more complex.
> > > >
> > >
> > > Yup.
> >
> >
> > Well, actually this is easy in this case.
> >
> > I'll post a patch to do this RCU conversion.
> >
> >
>
> Note : compile tested only, I'll appreciate if someone can test it ;)
>
> Note: one patch from net-2.6 is not yet included in net-next-2.6, so
> please make sure you have it before testing ;)
>
> ( http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=18943d292facbc70e6a36fc62399ae833f64671b )
>
>
> Thanks
>
> [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list
>
> in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).
>
> This can easily be converted to a RCU protection.
>
> Writers hold RTNL, so mc_list_lock is removed, not replaced by a
> spinlock.
>
> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Cypher Wu <[email protected]>
> Cc: Américo Wang <[email protected]>
> ---

...

> void ip_mc_up(struct in_device *in_dev)
> {
> - struct ip_mc_list *i;
> + struct ip_mc_list *pmc;
>
> ASSERT_RTNL();
>
> ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);
>
> - for (i=in_dev->mc_list; i; i=i->next)
> - igmp_group_added(i);
> + for_each_pmc_rtnl(in_dev, pmc);
> + igmp_group_added(pmc);
> }


Oops there is an extra ; after the for_each_pmc_rtnl(in_dev, pmc)

should be

for_each_pmc_rtnl(in_dev, pmc)
igmp_group_added(pmc);


2010-11-12 15:46:58

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list

Le vendredi 12 novembre 2010 à 15:26 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 14:34 +0100, Eric Dumazet a écrit :
> > Le vendredi 12 novembre 2010 à 10:22 +0100, Eric Dumazet a écrit :
> > > Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> > > > On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
> > >
> > > > >A RCU conversion is far more complex.
> > > > >
> > > >
> > > > Yup.
> > >
> > >
> > > Well, actually this is easy in this case.
> > >
> > > I'll post a patch to do this RCU conversion.
> > >
> > >
> >
> > Note : compile tested only, I'll appreciate if someone can test it ;)
> >
> > Note: one patch from net-2.6 is not yet included in net-next-2.6, so
> > please make sure you have it before testing ;)
> >
> > ( http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=18943d292facbc70e6a36fc62399ae833f64671b )
> >
> >
> > Thanks
> >
> > [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list
> >
> > in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).
> >
> > This can easily be converted to a RCU protection.
> >
> > Writers hold RTNL, so mc_list_lock is removed, not replaced by a
> > spinlock.
> >
> > Signed-off-by: Eric Dumazet <[email protected]>
> > Cc: Cypher Wu <[email protected]>
> > Cc: Américo Wang <[email protected]>
> > ---
>
> ...
>
> > void ip_mc_up(struct in_device *in_dev)
> > {
> > - struct ip_mc_list *i;
> > + struct ip_mc_list *pmc;
> >
> > ASSERT_RTNL();
> >
> > ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);
> >
> > - for (i=in_dev->mc_list; i; i=i->next)
> > - igmp_group_added(i);
> > + for_each_pmc_rtnl(in_dev, pmc);
> > + igmp_group_added(pmc);
> > }
>
>
> Oops there is an extra ; after the for_each_pmc_rtnl(in_dev, pmc)
>
> should be
>
> for_each_pmc_rtnl(in_dev, pmc)
> igmp_group_added(pmc);
>
>

I confirm that with above fix, the patch works for me.
(Tested with CONFIG_PROVE_RCU=y )

Here is an updated version.

[PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list

in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).

This can easily be converted to a RCU protection.

Writers hold RTNL, so mc_list_lock is removed, not replaced by a
spinlock.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Cypher Wu <[email protected]>
Cc: Américo Wang <[email protected]>
---
include/linux/igmp.h | 12 +
include/linux/inetdevice.h | 5
include/net/inet_sock.h | 2
net/ipv4/igmp.c | 223 ++++++++++++++++-------------------
4 files changed, 115 insertions(+), 127 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 93fc244..7d16467 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -167,10 +167,10 @@ struct ip_sf_socklist {
*/

struct ip_mc_socklist {
- struct ip_mc_socklist *next;
+ struct ip_mc_socklist __rcu *next_rcu;
struct ip_mreqn multi;
unsigned int sfmode; /* MCAST_{INCLUDE,EXCLUDE} */
- struct ip_sf_socklist *sflist;
+ struct ip_sf_socklist __rcu *sflist;
struct rcu_head rcu;
};

@@ -186,11 +186,14 @@ struct ip_sf_list {
struct ip_mc_list {
struct in_device *interface;
__be32 multiaddr;
+ unsigned int sfmode;
struct ip_sf_list *sources;
struct ip_sf_list *tomb;
- unsigned int sfmode;
unsigned long sfcount[2];
- struct ip_mc_list *next;
+ union {
+ struct ip_mc_list *next;
+ struct ip_mc_list __rcu *next_rcu;
+ };
struct timer_list timer;
int users;
atomic_t refcnt;
@@ -201,6 +204,7 @@ struct ip_mc_list {
char loaded;
unsigned char gsquery; /* check source marks? */
unsigned char crcount;
+ struct rcu_head rcu;
};

/* V3 exponential field decoding */
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ccd5b07..380ba6b 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -52,9 +52,8 @@ struct in_device {
atomic_t refcnt;
int dead;
struct in_ifaddr *ifa_list; /* IP ifaddr chain */
- rwlock_t mc_list_lock;
- struct ip_mc_list *mc_list; /* IP multicast filter chain */
- int mc_count; /* Number of installed mcasts */
+ struct ip_mc_list __rcu *mc_list; /* IP multicast filter chain */
+ int mc_count; /* Number of installed mcasts */
spinlock_t mc_tomb_lock;
struct ip_mc_list *mc_tomb;
unsigned long mr_v1_seen;
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 1989cfd..8945f9f 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -141,7 +141,7 @@ struct inet_sock {
nodefrag:1;
int mc_index;
__be32 mc_addr;
- struct ip_mc_socklist *mc_list;
+ struct ip_mc_socklist __rcu *mc_list;
struct {
unsigned int flags;
unsigned int fragsize;
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 08d0d81..6f49d6c 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -149,11 +149,17 @@ static void ip_mc_clear_src(struct ip_mc_list *pmc);
static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
int sfcount, __be32 *psfsrc, int delta);

+
+static void ip_mc_list_reclaim(struct rcu_head *head)
+{
+ kfree(container_of(head, struct ip_mc_list, rcu));
+}
+
static void ip_ma_put(struct ip_mc_list *im)
{
if (atomic_dec_and_test(&im->refcnt)) {
in_dev_put(im->interface);
- kfree(im);
+ call_rcu(&im->rcu, ip_mc_list_reclaim);
}
}

@@ -163,7 +169,7 @@ static void ip_ma_put(struct ip_mc_list *im)
* Timer management
*/

-static __inline__ void igmp_stop_timer(struct ip_mc_list *im)
+static void igmp_stop_timer(struct ip_mc_list *im)
{
spin_lock_bh(&im->lock);
if (del_timer(&im->timer))
@@ -496,14 +502,24 @@ empty_source:
return skb;
}

+#define for_each_pmc_rcu(in_dev, pmc) \
+ for (pmc = rcu_dereference(in_dev->mc_list); \
+ pmc != NULL; \
+ pmc = rcu_dereference(pmc->next_rcu))
+
+#define for_each_pmc_rtnl(in_dev, pmc) \
+ for (pmc = rtnl_dereference(in_dev->mc_list); \
+ pmc != NULL; \
+ pmc = rtnl_dereference(pmc->next_rcu))
+
static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
{
struct sk_buff *skb = NULL;
int type;

if (!pmc) {
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
if (pmc->multiaddr == IGMP_ALL_HOSTS)
continue;
spin_lock_bh(&pmc->lock);
@@ -514,7 +530,7 @@ static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
skb = add_grec(skb, pmc, type, 0, 0);
spin_unlock_bh(&pmc->lock);
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
} else {
spin_lock_bh(&pmc->lock);
if (pmc->sfcount[MCAST_EXCLUDE])
@@ -556,7 +572,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
struct sk_buff *skb = NULL;
int type, dtype;

- read_lock(&in_dev->mc_list_lock);
+ rcu_read_lock();
spin_lock_bh(&in_dev->mc_tomb_lock);

/* deleted MCA's */
@@ -593,7 +609,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
spin_unlock_bh(&in_dev->mc_tomb_lock);

/* change recs */
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rcu(in_dev, pmc) {
spin_lock_bh(&pmc->lock);
if (pmc->sfcount[MCAST_EXCLUDE]) {
type = IGMPV3_BLOCK_OLD_SOURCES;
@@ -616,7 +632,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
}
spin_unlock_bh(&pmc->lock);
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();

if (!skb)
return;
@@ -813,14 +829,14 @@ static void igmp_heard_report(struct in_device *in_dev, __be32 group)
if (group == IGMP_ALL_HOSTS)
return;

- read_lock(&in_dev->mc_list_lock);
- for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, im) {
if (im->multiaddr == group) {
igmp_stop_timer(im);
break;
}
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
}

static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
@@ -906,8 +922,8 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
* - Use the igmp->igmp_code field as the maximum
* delay possible
*/
- read_lock(&in_dev->mc_list_lock);
- for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, im) {
int changed;

if (group && group != im->multiaddr)
@@ -925,7 +941,7 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
if (changed)
igmp_mod_timer(im, max_delay);
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
}

/* called in rcu_read_lock() section */
@@ -1110,8 +1126,8 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
kfree(pmc);
}
/* clear dead sources, too */
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
struct ip_sf_list *psf, *psf_next;

spin_lock_bh(&pmc->lock);
@@ -1123,7 +1139,7 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
kfree(psf);
}
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
}
#endif

@@ -1209,7 +1225,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)

ASSERT_RTNL();

- for (im=in_dev->mc_list; im; im=im->next) {
+ for_each_pmc_rtnl(in_dev, im) {
if (im->multiaddr == addr) {
im->users++;
ip_mc_add_src(in_dev, &addr, MCAST_EXCLUDE, 0, NULL, 0);
@@ -1217,7 +1233,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
}
}

- im = kmalloc(sizeof(*im), GFP_KERNEL);
+ im = kzalloc(sizeof(*im), GFP_KERNEL);
if (!im)
goto out;

@@ -1227,26 +1243,18 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
im->multiaddr = addr;
/* initial mode is (EX, empty) */
im->sfmode = MCAST_EXCLUDE;
- im->sfcount[MCAST_INCLUDE] = 0;
im->sfcount[MCAST_EXCLUDE] = 1;
- im->sources = NULL;
- im->tomb = NULL;
- im->crcount = 0;
atomic_set(&im->refcnt, 1);
spin_lock_init(&im->lock);
#ifdef CONFIG_IP_MULTICAST
- im->tm_running = 0;
setup_timer(&im->timer, &igmp_timer_expire, (unsigned long)im);
im->unsolicit_count = IGMP_Unsolicited_Report_Count;
- im->reporter = 0;
- im->gsquery = 0;
#endif
- im->loaded = 0;
- write_lock_bh(&in_dev->mc_list_lock);
- im->next = in_dev->mc_list;
- in_dev->mc_list = im;
+
+ im->next_rcu = in_dev->mc_list;
in_dev->mc_count++;
- write_unlock_bh(&in_dev->mc_list_lock);
+ rcu_assign_pointer(in_dev->mc_list, im);
+
#ifdef CONFIG_IP_MULTICAST
igmpv3_del_delrec(in_dev, im->multiaddr);
#endif
@@ -1287,17 +1295,18 @@ EXPORT_SYMBOL(ip_mc_rejoin_group);

void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
{
- struct ip_mc_list *i, **ip;
+ struct ip_mc_list *i;
+ struct ip_mc_list __rcu **ip;

ASSERT_RTNL();

- for (ip=&in_dev->mc_list; (i=*ip)!=NULL; ip=&i->next) {
+ for (ip = &in_dev->mc_list;
+ (i = rtnl_dereference(*ip)) != NULL;
+ ip = &i->next_rcu) {
if (i->multiaddr == addr) {
if (--i->users == 0) {
- write_lock_bh(&in_dev->mc_list_lock);
- *ip = i->next;
+ *ip = i->next_rcu;
in_dev->mc_count--;
- write_unlock_bh(&in_dev->mc_list_lock);
igmp_group_dropped(i);

if (!in_dev->dead)
@@ -1316,34 +1325,34 @@ EXPORT_SYMBOL(ip_mc_dec_group);

void ip_mc_unmap(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

- for (i = in_dev->mc_list; i; i = i->next)
- igmp_group_dropped(i);
+ for_each_pmc_rtnl(in_dev, pmc)
+ igmp_group_dropped(pmc);
}

void ip_mc_remap(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

- for (i = in_dev->mc_list; i; i = i->next)
- igmp_group_added(i);
+ for_each_pmc_rtnl(in_dev, pmc)
+ igmp_group_added(pmc);
}

/* Device going down */

void ip_mc_down(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

- for (i=in_dev->mc_list; i; i=i->next)
- igmp_group_dropped(i);
+ for_each_pmc_rtnl(in_dev, pmc)
+ igmp_group_dropped(pmc);

#ifdef CONFIG_IP_MULTICAST
in_dev->mr_ifc_count = 0;
@@ -1374,7 +1383,6 @@ void ip_mc_init_dev(struct in_device *in_dev)
in_dev->mr_qrv = IGMP_Unsolicited_Report_Count;
#endif

- rwlock_init(&in_dev->mc_list_lock);
spin_lock_init(&in_dev->mc_tomb_lock);
}

@@ -1382,14 +1390,14 @@ void ip_mc_init_dev(struct in_device *in_dev)

void ip_mc_up(struct in_device *in_dev)
{
- struct ip_mc_list *i;
+ struct ip_mc_list *pmc;

ASSERT_RTNL();

ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);

- for (i=in_dev->mc_list; i; i=i->next)
- igmp_group_added(i);
+ for_each_pmc_rtnl(in_dev, pmc)
+ igmp_group_added(pmc);
}

/*
@@ -1405,17 +1413,13 @@ void ip_mc_destroy_dev(struct in_device *in_dev)
/* Deactivate timers */
ip_mc_down(in_dev);

- write_lock_bh(&in_dev->mc_list_lock);
- while ((i = in_dev->mc_list) != NULL) {
- in_dev->mc_list = i->next;
+ while ((i = rtnl_dereference(in_dev->mc_list)) != NULL) {
+ in_dev->mc_list = i->next_rcu;
in_dev->mc_count--;
- write_unlock_bh(&in_dev->mc_list_lock);
+
igmp_group_dropped(i);
ip_ma_put(i);
-
- write_lock_bh(&in_dev->mc_list_lock);
}
- write_unlock_bh(&in_dev->mc_list_lock);
}

/* RTNL is locked */
@@ -1513,18 +1517,18 @@ static int ip_mc_del_src(struct in_device *in_dev, __be32 *pmca, int sfmode,

if (!in_dev)
return -ENODEV;
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
if (*pmca == pmc->multiaddr)
break;
}
if (!pmc) {
/* MCA not found?? bug */
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
return -ESRCH;
}
spin_lock_bh(&pmc->lock);
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
#ifdef CONFIG_IP_MULTICAST
sf_markstate(pmc);
#endif
@@ -1685,18 +1689,18 @@ static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,

if (!in_dev)
return -ENODEV;
- read_lock(&in_dev->mc_list_lock);
- for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, pmc) {
if (*pmca == pmc->multiaddr)
break;
}
if (!pmc) {
/* MCA not found?? bug */
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
return -ESRCH;
}
spin_lock_bh(&pmc->lock);
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();

#ifdef CONFIG_IP_MULTICAST
sf_markstate(pmc);
@@ -1793,7 +1797,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)

err = -EADDRINUSE;
ifindex = imr->imr_ifindex;
- for (i = inet->mc_list; i; i = i->next) {
+ for_each_pmc_rtnl(inet, i) {
if (i->multi.imr_multiaddr.s_addr == addr &&
i->multi.imr_ifindex == ifindex)
goto done;
@@ -1807,7 +1811,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
goto done;

memcpy(&iml->multi, imr, sizeof(*imr));
- iml->next = inet->mc_list;
+ iml->next_rcu = inet->mc_list;
iml->sflist = NULL;
iml->sfmode = MCAST_EXCLUDE;
rcu_assign_pointer(inet->mc_list, iml);
@@ -1821,17 +1825,14 @@ EXPORT_SYMBOL(ip_mc_join_group);

static void ip_sf_socklist_reclaim(struct rcu_head *rp)
{
- struct ip_sf_socklist *psf;
-
- psf = container_of(rp, struct ip_sf_socklist, rcu);
+ kfree(container_of(rp, struct ip_sf_socklist, rcu));
/* sk_omem_alloc should have been decreased by the caller*/
- kfree(psf);
}

static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,
struct in_device *in_dev)
{
- struct ip_sf_socklist *psf = iml->sflist;
+ struct ip_sf_socklist *psf = rtnl_dereference(iml->sflist);
int err;

if (psf == NULL) {
@@ -1851,11 +1852,8 @@ static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,

static void ip_mc_socklist_reclaim(struct rcu_head *rp)
{
- struct ip_mc_socklist *iml;
-
- iml = container_of(rp, struct ip_mc_socklist, rcu);
+ kfree(container_of(rp, struct ip_mc_socklist, rcu));
/* sk_omem_alloc should have been decreased by the caller*/
- kfree(iml);
}


@@ -1866,7 +1864,8 @@ static void ip_mc_socklist_reclaim(struct rcu_head *rp)
int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
{
struct inet_sock *inet = inet_sk(sk);
- struct ip_mc_socklist *iml, **imlp;
+ struct ip_mc_socklist *iml;
+ struct ip_mc_socklist __rcu **imlp;
struct in_device *in_dev;
struct net *net = sock_net(sk);
__be32 group = imr->imr_multiaddr.s_addr;
@@ -1876,7 +1875,9 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
rtnl_lock();
in_dev = ip_mc_find_dev(net, imr);
ifindex = imr->imr_ifindex;
- for (imlp = &inet->mc_list; (iml = *imlp) != NULL; imlp = &iml->next) {
+ for (imlp = &inet->mc_list;
+ (iml = rtnl_dereference(*imlp)) != NULL;
+ imlp = &iml->next_rcu) {
if (iml->multi.imr_multiaddr.s_addr != group)
continue;
if (ifindex) {
@@ -1888,7 +1889,7 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)

(void) ip_mc_leave_src(sk, iml, in_dev);

- rcu_assign_pointer(*imlp, iml->next);
+ *imlp = iml->next_rcu;

if (in_dev)
ip_mc_dec_group(in_dev, group);
@@ -1934,7 +1935,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
}
err = -EADDRNOTAVAIL;

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if ((pmc->multi.imr_multiaddr.s_addr ==
imr.imr_multiaddr.s_addr) &&
(pmc->multi.imr_ifindex == imr.imr_ifindex))
@@ -1958,7 +1959,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
pmc->sfmode = omode;
}

- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
if (!add) {
if (!psl)
goto done; /* err = -EADDRNOTAVAIL */
@@ -2077,7 +2078,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
goto done;
}

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
pmc->multi.imr_ifindex == imr.imr_ifindex)
break;
@@ -2107,7 +2108,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
(void) ip_mc_add_src(in_dev, &msf->imsf_multiaddr,
msf->imsf_fmode, 0, NULL, 0);
}
- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
if (psl) {
(void) ip_mc_del_src(in_dev, &msf->imsf_multiaddr, pmc->sfmode,
psl->sl_count, psl->sl_addr, 0);
@@ -2155,7 +2156,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
}
err = -EADDRNOTAVAIL;

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
pmc->multi.imr_ifindex == imr.imr_ifindex)
break;
@@ -2163,7 +2164,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
if (!pmc) /* must have a prior join */
goto done;
msf->imsf_fmode = pmc->sfmode;
- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
rtnl_unlock();
if (!psl) {
len = 0;
@@ -2208,7 +2209,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,

err = -EADDRNOTAVAIL;

- for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+ for_each_pmc_rtnl(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == addr &&
pmc->multi.imr_ifindex == gsf->gf_interface)
break;
@@ -2216,7 +2217,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,
if (!pmc) /* must have a prior join */
goto done;
gsf->gf_fmode = pmc->sfmode;
- psl = pmc->sflist;
+ psl = rtnl_dereference(pmc->sflist);
rtnl_unlock();
count = psl ? psl->sl_count : 0;
copycount = count < gsf->gf_numsrc ? count : gsf->gf_numsrc;
@@ -2257,7 +2258,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
goto out;

rcu_read_lock();
- for (pmc=rcu_dereference(inet->mc_list); pmc; pmc=rcu_dereference(pmc->next)) {
+ for_each_pmc_rcu(inet, pmc) {
if (pmc->multi.imr_multiaddr.s_addr == loc_addr &&
pmc->multi.imr_ifindex == dif)
break;
@@ -2265,7 +2266,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
ret = inet->mc_all;
if (!pmc)
goto unlock;
- psl = pmc->sflist;
+ psl = rcu_dereference(pmc->sflist);
ret = (pmc->sfmode == MCAST_EXCLUDE);
if (!psl)
goto unlock;
@@ -2300,10 +2301,10 @@ void ip_mc_drop_socket(struct sock *sk)
return;

rtnl_lock();
- while ((iml = inet->mc_list) != NULL) {
+ while ((iml = rtnl_dereference(inet->mc_list)) != NULL) {
struct in_device *in_dev;
- rcu_assign_pointer(inet->mc_list, iml->next);

+ inet->mc_list = iml->next_rcu;
in_dev = inetdev_by_index(net, iml->multi.imr_ifindex);
(void) ip_mc_leave_src(sk, iml, in_dev);
if (in_dev != NULL) {
@@ -2323,8 +2324,8 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
struct ip_sf_list *psf;
int rv = 0;

- read_lock(&in_dev->mc_list_lock);
- for (im=in_dev->mc_list; im; im=im->next) {
+ rcu_read_lock();
+ for_each_pmc_rcu(in_dev, im) {
if (im->multiaddr == mc_addr)
break;
}
@@ -2345,7 +2346,7 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
} else
rv = 1; /* unspecified source; tentatively allow */
}
- read_unlock(&in_dev->mc_list_lock);
+ rcu_read_unlock();
return rv;
}

@@ -2371,13 +2372,11 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
in_dev = __in_dev_get_rcu(state->dev);
if (!in_dev)
continue;
- read_lock(&in_dev->mc_list_lock);
- im = in_dev->mc_list;
+ im = rcu_dereference(in_dev->mc_list);
if (im) {
state->in_dev = in_dev;
break;
}
- read_unlock(&in_dev->mc_list_lock);
}
return im;
}
@@ -2385,11 +2384,9 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_list *im)
{
struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
- im = im->next;
- while (!im) {
- if (likely(state->in_dev != NULL))
- read_unlock(&state->in_dev->mc_list_lock);

+ im = rcu_dereference(im->next_rcu);
+ while (!im) {
state->dev = next_net_device_rcu(state->dev);
if (!state->dev) {
state->in_dev = NULL;
@@ -2398,8 +2395,7 @@ static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_li
state->in_dev = __in_dev_get_rcu(state->dev);
if (!state->in_dev)
continue;
- read_lock(&state->in_dev->mc_list_lock);
- im = state->in_dev->mc_list;
+ im = rcu_dereference(state->in_dev->mc_list);
}
return im;
}
@@ -2435,10 +2431,8 @@ static void igmp_mc_seq_stop(struct seq_file *seq, void *v)
__releases(rcu)
{
struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
- if (likely(state->in_dev != NULL)) {
- read_unlock(&state->in_dev->mc_list_lock);
- state->in_dev = NULL;
- }
+
+ state->in_dev = NULL;
state->dev = NULL;
rcu_read_unlock();
}
@@ -2460,7 +2454,7 @@ static int igmp_mc_seq_show(struct seq_file *seq, void *v)
querier = "NONE";
#endif

- if (state->in_dev->mc_list == im) {
+ if (rcu_dereference(state->in_dev->mc_list) == im) {
seq_printf(seq, "%d\t%-10s: %5d %7s\n",
state->dev->ifindex, state->dev->name, state->in_dev->mc_count, querier);
}
@@ -2519,8 +2513,7 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
idev = __in_dev_get_rcu(state->dev);
if (unlikely(idev == NULL))
continue;
- read_lock(&idev->mc_list_lock);
- im = idev->mc_list;
+ im = rcu_dereference(idev->mc_list);
if (likely(im != NULL)) {
spin_lock_bh(&im->lock);
psf = im->sources;
@@ -2531,7 +2524,6 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
}
spin_unlock_bh(&im->lock);
}
- read_unlock(&idev->mc_list_lock);
}
return psf;
}
@@ -2545,9 +2537,6 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
spin_unlock_bh(&state->im->lock);
state->im = state->im->next;
while (!state->im) {
- if (likely(state->idev != NULL))
- read_unlock(&state->idev->mc_list_lock);
-
state->dev = next_net_device_rcu(state->dev);
if (!state->dev) {
state->idev = NULL;
@@ -2556,8 +2545,7 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
state->idev = __in_dev_get_rcu(state->dev);
if (!state->idev)
continue;
- read_lock(&state->idev->mc_list_lock);
- state->im = state->idev->mc_list;
+ state->im = rcu_dereference(state->idev->mc_list);
}
if (!state->im)
break;
@@ -2603,10 +2591,7 @@ static void igmp_mcf_seq_stop(struct seq_file *seq, void *v)
spin_unlock_bh(&state->im->lock);
state->im = NULL;
}
- if (likely(state->idev != NULL)) {
- read_unlock(&state->idev->mc_list_lock);
- state->idev = NULL;
- }
+ state->idev = NULL;
state->dev = NULL;
rcu_read_unlock();
}

2010-11-12 21:19:14

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list

From: Eric Dumazet <[email protected]>
Date: Fri, 12 Nov 2010 16:46:50 +0100

> [PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list
>
> in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).
>
> This can easily be converted to a RCU protection.
>
> Writers hold RTNL, so mc_list_lock is removed, not replaced by a
> spinlock.
>
> Signed-off-by: Eric Dumazet <[email protected]>

This looks good to me, so I've applied it to net-next-2.6

We have enough time there to fix any fallout or revert if
necessary.

2010-11-13 06:25:26

by Cong Wang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 09:00:17PM +0800, Yong Zhang wrote:
>On Fri, Nov 12, 2010 at 05:18:18PM +0800, Américo Wang wrote:
>> On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
>> >On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <[email protected]> wrote:
>> >> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>> >>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>> >>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> >>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>> >>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>> >>>> >>
>> >>>> >> Hi
>> >>>> >>
>> >>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>> >>>> >>
>> >>>> >>
>> >>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> >>>> >>> other platforms. It have a priority for write lock that when tried it
>> >>>> >>> will block the following read lock even if read lock is hold by
>> >>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>> >>>> >>> arch/tile/lib/spinlock_32.c.
>> >>>> >>
>> >>>> >> This seems a bug to me.
>> >>>> >>
>> >>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>> >>>> >> (it can re-enter itself),
>> >>>> >> and we used instead a spinlock(), but with many discussions with lkml
>> >>>> >> and Linus himself if I remember well.
>> >>>> >>
>> >>>> >It seems not a problem that read_lock() can be nested or not since
>> >>>> >rwlock doesn't have 'owner', it's just that should we give
>> >>>> >write_lock() a priority than read_lock() since if there have a lot
>> >>>> >read_lock()s then they'll starve write_lock().
>> >>>> >We should work out a well defined behavior so all the
>> >>>> >platform-dependent raw_rwlock has to design under that principle.
>> >>>>
>> >>>
>> >>>AFAIK, Lockdep allows read_lock() to be nested.
>> >>>
>> >>>> It is a known weakness of rwlock, it is designed like that. :)
>> >>>>
>> >>>
>> >>>Agreed.
>> >>>
>> >>
>> >> Just for record, both Tile and X86 implement rwlock with a write-bias,
>> >> this somewhat reduces the write-starvation problem.
>> >
>> >Are you sure(on x86)?
>> >
>> >It seems that we never realize writer-bias rwlock.
>> >
>>
>> Try
>>
>> % grep RW_LOCK_BIAS -nr arch/x86
>>
>> *And* read the code to see how it works. :)
>
>If read_lock()/write_lock() fails, the subtracted value(1 for
>read_lock() and RW_LOCK_BIAS for write_lock()) is added back.
>So reader and writer will contend on the same lock fairly.
>
>And RW_LOCK_BIAS based rwlock is a variant of sighed-test
>rwlock, so it works in the same way to highest-bit-set mode
>rwlock.
>
>Seem you're cheated by it's name(RW_LOCK_BIAS). :)

Ah, no, I made a mistake that I thought the initial value
of rwlock is something like 0, but clearly it is RW_LOCK_BIAS.
Yeah, then there is certainly no bias to writers, and x86
must be using almost the same algorithm with Tile.

--
Live like a child, think like the god.

2010-11-13 06:32:50

by Cong Wang

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, Nov 12, 2010 at 07:06:47PM +0800, Cypher Wu wrote:
>>
>> Note, on Tile, it uses a little different algorithm.
>>
>
>It seems that rwlock on x86 and tile have different behavior, x86 use
>RW_LOCK_BIAS, when read_lock() it will test if the lock is 0, and if
>so then the read_lock() have to 'spinning', otherwise it dec the lock;
>when write_lock() tried it first check if lock is It seems that rwlock
>on x86 and tile have different behavior, x86 use RW_LOCK_BIAS and if
>so, set lock to 0 and continue, otherwise it will 'spinning'.
>I'm not very familiar with x86 architecture, but the code seems like
>working that way.

No, they should be the same, sorry I made a mistake in the above reply.

Although Tile uses shifts in implementation while x86 uses inc/dec,
the idea is same, either writers use higher bits and readers use
lower bits or vice-versa.

--
Live like a child, think like the god.

2010-11-13 06:41:21

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list

>
>Here is an updated version.
>
>[PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list
>
>in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).
>
>This can easily be converted to a RCU protection.
>
>Writers hold RTNL, so mc_list_lock is removed, not replaced by a
>spinlock.

Ah, this saves much work.

>
>Signed-off-by: Eric Dumazet <[email protected]>
>Cc: Cypher Wu <[email protected]>
>Cc: Américo Wang <[email protected]>

I just did a quick look, it looks good to me.

Thanks!

2010-11-13 22:51:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Thu, 2010-11-11 at 21:49 +0800, Cypher Wu wrote:
> I'm using TILEPro and its rwlock in kernel is a liitle different than
> other platforms. It have a priority for write lock that when tried it
> will block the following read lock even if read lock is hold by
> others.

That's a bug, rwlock_t must most definitely be a read preference lock,
there's tons of code depending on that.

2010-11-13 22:53:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, 2010-11-12 at 11:32 +0800, Cypher Wu wrote:
> It seems not a problem that read_lock() can be nested or not since
> rwlock doesn't have 'owner',

You're mistaken.

> it's just that should we give
> write_lock() a priority than read_lock() since if there have a lot
> read_lock()s then they'll starve write_lock().

We rely on that behaviour. FWIW write preference locks will starve
readers.

> We should work out a well defined behavior so all the
> platform-dependent raw_rwlock has to design under that principle.

We have that, all archs have read preference rwlock_t, they have to,
code relies on it.

2010-11-13 22:54:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Fri, 2010-11-12 at 16:19 +0800, Américo Wang wrote:
>
> Just for record, both Tile and X86 implement rwlock with a write-bias,
> this somewhat reduces the write-starvation problem.

x86 does no such thing.

2010-11-13 23:09:27

by Chris Metcalf

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On 11/12/2010 2:13 AM, Américo Wang wrote:
> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>>> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>> other platforms. It have a priority for write lock that when tried it
>>>> will block the following read lock even if read lock is hold by
>>>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>> arch/tile/lib/spinlock_32.c.
>>>
>>> This seems a bug to me.
>>> [...]
>>>
>> It seems not a problem that read_lock() can be nested or not since
>> rwlock doesn't have 'owner', it's just that should we give
>> write_lock() a priority than read_lock() since if there have a lot
>> read_lock()s then they'll starve write_lock().
>> We should work out a well defined behavior so all the
>> platform-dependent raw_rwlock has to design under that principle.
>
> It is a known weakness of rwlock, it is designed like that. :)

Exactly. The tile rwlock correctly allows recursively reacquiring the read
lock. But it does give priority to writers, for the (unfortunately
incorrect) reasons Cypher Wu outlined above, e.g.:

- Core A takes a read lock
- Core B tries for a write lock and blocks new read locks
- Core A tries for a (recursive) read lock and blocks

Core A and B are now deadlocked.

The solution is actually to simplify the tile rwlock implementation so that
both readers and writers contend fairly for the lock.

I'll post a patch in the next day or two for tile.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2010-11-15 07:22:07

by Cyberman Wu

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Sun, Nov 14, 2010 at 7:03 AM, Chris Metcalf <[email protected]> wrote:
> On 11/12/2010 2:13 AM, Am?rico Wang wrote:
>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>> On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <[email protected]> wrote:
>>>> Le jeudi 11 novembre 2010 ? 21:49 +0800, Cypher Wu a ?crit :
>>>>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>>> other platforms. It have a priority for write lock that when tried it
>>>>> will block the following read lock even if read lock is hold by
>>>>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>>> arch/tile/lib/spinlock_32.c.
>>>>
>>>> This seems a bug to me.
>>>> [...]
>>>>
>>> It seems not a problem that read_lock() can be nested or not since
>>> rwlock doesn't have 'owner', it's just that should we give
>>> write_lock() a priority than read_lock() since if there have a lot
>>> read_lock()s then they'll starve write_lock().
>>> We should work out a well defined behavior so all the
>>> platform-dependent raw_rwlock has to design under that principle.
>>
>> It is a known weakness of rwlock, it is designed like that. :)
>
> Exactly. The tile rwlock correctly allows recursively reacquiring the read
> lock. But it does give priority to writers, for the (unfortunately
> incorrect) reasons Cypher Wu outlined above, e.g.:
>
> - Core A takes a read lock
> - Core B tries for a write lock and blocks new read locks
> - Core A tries for a (recursive) read lock and blocks
>
> Core A and B are now deadlocked.
>
> The solution is actually to simplify the tile rwlock implementation so that
> both readers and writers contend fairly for the lock.
>
> I'll post a patch in the next day or two for tile.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>

We're looking forward to you patch. BTW: could your fix it up in Linux
2.6.26.7 which is not release in the normal kernel?

--
Cyberman Wu
http://www.meganovo.com

2010-11-15 11:18:41

by Cyberman Wu

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

In that post I want to confirm another thing: if we join/leave on
different cores that every call will start the timer for IGMP message
using the same in_dev->mc_list, could that be optimized?

--
Cyberman Wu
http://www.meganovo.com

2010-11-15 11:31:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

Le lundi 15 novembre 2010 à 19:18 +0800, Cypher Wu a écrit :
> In that post I want to confirm another thing: if we join/leave on
> different cores that every call will start the timer for IGMP message
> using the same in_dev->mc_list, could that be optimized?
>

Which timer exactly ? Is it a real scalability problem ?

I believe RTNL would be the blocking point actually...


2010-11-15 14:25:33

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

This avoids a deadlock in the IGMP code where one core gets a read
lock, another core starts trying to get a write lock (thus blocking
new readers), and then the first core tries to recursively re-acquire
the read lock.

We still try to preserve some degree of balance by giving priority
to additional write lockers that come along while the lock is held
for write, so they can all complete quickly and return the lock to
the readers.

Signed-off-by: Chris Metcalf <[email protected]>
---
This should apply relatively cleanly to 2.6.26.7 source code too.

arch/tile/lib/spinlock_32.c | 29 ++++++++++++++++++-----------
1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
index 485e24d..5cd1c40 100644
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -167,23 +167,30 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
* when we compare them.
*/
u32 my_ticket_;
+ u32 iterations = 0;

- /* Take out the next ticket; this will also stop would-be readers. */
- if (val & 1)
- val = get_rwlock(rwlock);
- rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
+ /*
+ * Wait until there are no readers, then bump up the next
+ * field and capture the ticket value.
+ */
+ for (;;) {
+ if (!(val & 1)) {
+ if ((val >> RD_COUNT_SHIFT) == 0)
+ break;
+ rwlock->lock = val;
+ }
+ delay_backoff(iterations++);
+ val = __insn_tns((int *)&rwlock->lock);
+ }

- /* Extract my ticket value from the original word. */
+ /* Take out the next ticket and extract my ticket value. */
+ rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
my_ticket_ = val >> WR_NEXT_SHIFT;

- /*
- * Wait until the "current" field matches our ticket, and
- * there are no remaining readers.
- */
+ /* Wait until the "current" field matches our ticket. */
for (;;) {
u32 curr_ = val >> WR_CURR_SHIFT;
- u32 readers = val >> RD_COUNT_SHIFT;
- u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
+ u32 delta = ((my_ticket_ - curr_) & WR_MASK);
if (likely(delta == 0))
break;

--
1.6.5.2

2010-11-15 14:52:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

Le lundi 15 novembre 2010 à 09:18 -0500, Chris Metcalf a écrit :
> This avoids a deadlock in the IGMP code where one core gets a read
> lock, another core starts trying to get a write lock (thus blocking
> new readers), and then the first core tries to recursively re-acquire
> the read lock.
>
> We still try to preserve some degree of balance by giving priority
> to additional write lockers that come along while the lock is held
> for write, so they can all complete quickly and return the lock to
> the readers.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> This should apply relatively cleanly to 2.6.26.7 source code too.
>
> arch/tile/lib/spinlock_32.c | 29 ++++++++++++++++++-----------
> 1 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
> index 485e24d..5cd1c40 100644
> --- a/arch/tile/lib/spinlock_32.c
> +++ b/arch/tile/lib/spinlock_32.c
> @@ -167,23 +167,30 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
> * when we compare them.
> */
> u32 my_ticket_;
> + u32 iterations = 0;
>
> - /* Take out the next ticket; this will also stop would-be readers. */
> - if (val & 1)
> - val = get_rwlock(rwlock);
> - rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
> + /*
> + * Wait until there are no readers, then bump up the next
> + * field and capture the ticket value.
> + */
> + for (;;) {
> + if (!(val & 1)) {
> + if ((val >> RD_COUNT_SHIFT) == 0)
> + break;
> + rwlock->lock = val;
> + }
> + delay_backoff(iterations++);

Are you sure a writer should have a growing delay_backoff() ?

It seems to me this only allow new readers to come (so adding more
unfairness to the rwlock, that already favor readers against writer[s])

Maybe allow one cpu to spin, and eventually other 'writers' be queued ?

> + val = __insn_tns((int *)&rwlock->lock);
> + }
>
> - /* Extract my ticket value from the original word. */
> + /* Take out the next ticket and extract my ticket value. */
> + rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
> my_ticket_ = val >> WR_NEXT_SHIFT;
>
> - /*
> - * Wait until the "current" field matches our ticket, and
> - * there are no remaining readers.
> - */
> + /* Wait until the "current" field matches our ticket. */
> for (;;) {
> u32 curr_ = val >> WR_CURR_SHIFT;
> - u32 readers = val >> RD_COUNT_SHIFT;
> - u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
> + u32 delta = ((my_ticket_ - curr_) & WR_MASK);
> if (likely(delta == 0))
> break;
>

2010-11-15 15:10:59

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

On 11/15/2010 9:52 AM, Eric Dumazet wrote:
> Le lundi 15 novembre 2010 à 09:18 -0500, Chris Metcalf a écrit :
>> This avoids a deadlock in the IGMP code where one core gets a read
>> lock, another core starts trying to get a write lock (thus blocking
>> new readers), and then the first core tries to recursively re-acquire
>> the read lock.
>>
>> We still try to preserve some degree of balance by giving priority
>> to additional write lockers that come along while the lock is held
>> for write, so they can all complete quickly and return the lock to
>> the readers.
>>
>> Signed-off-by: Chris Metcalf <[email protected]>
>> ---
>> This should apply relatively cleanly to 2.6.26.7 source code too.
>>
>> arch/tile/lib/spinlock_32.c | 29 ++++++++++++++++++-----------
>> 1 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
>> index 485e24d..5cd1c40 100644
>> --- a/arch/tile/lib/spinlock_32.c
>> +++ b/arch/tile/lib/spinlock_32.c
>> @@ -167,23 +167,30 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
>> * when we compare them.
>> */
>> u32 my_ticket_;
>> + u32 iterations = 0;
>>
>> - /* Take out the next ticket; this will also stop would-be readers. */
>> - if (val & 1)
>> - val = get_rwlock(rwlock);
>> - rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
>> + /*
>> + * Wait until there are no readers, then bump up the next
>> + * field and capture the ticket value.
>> + */
>> + for (;;) {
>> + if (!(val & 1)) {
>> + if ((val >> RD_COUNT_SHIFT) == 0)
>> + break;
>> + rwlock->lock = val;
>> + }
>> + delay_backoff(iterations++);
> Are you sure a writer should have a growing delay_backoff() ?

We always do this bounded exponential backoff on all locking operations
that require memory-network traffic. With 64 cores, it's possible
otherwise to get into a situation where the cores are attempting to acquire
the lock sufficiently aggressively that lock acquisition performance is
worse than it would be with backoff. In any case this path is unlikely to
run many times, since it only triggers if two cores both try to pull a
ticket at the same time; it doesn't correspond to writers actually waiting
once they have their ticket, which is handled later in this function.

> It seems to me this only allow new readers to come (so adding more
> unfairness to the rwlock, that already favor readers against writer[s])

Well, that is apparently the required semantic. Once there is one reader,
you must allow new readers, to handle the case of recursive
re-acquisition. In principle you could imagine doing something like having
a bitmask of cores that held the readlock (instead of a count), and only
allowing recursive re-acquisition when a write lock request is pending, but
this would make the lock structure bigger, and I'm not sure it's worth it.
x86 certainly doesn't bother.

> Maybe allow one cpu to spin, and eventually other 'writers' be queued ?

Other than the brief spin to acquire the ticket, writers don't actually
spin on the lock. They just wait for their ticket to come up. This does
require spinning on memory reads, but those are satisfied out of the local
core's cache, with the exception that each time a writer completes, the
cache line is invalidated on the readers, and they have to re-fetch it from
the home cache.

>> + val = __insn_tns((int *)&rwlock->lock);
>> + }
>>
>> - /* Extract my ticket value from the original word. */
>> + /* Take out the next ticket and extract my ticket value. */
>> + rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
>> my_ticket_ = val >> WR_NEXT_SHIFT;
>>
>> - /*
>> - * Wait until the "current" field matches our ticket, and
>> - * there are no remaining readers.
>> - */
>> + /* Wait until the "current" field matches our ticket. */
>> for (;;) {
>> u32 curr_ = val >> WR_CURR_SHIFT;
>> - u32 readers = val >> RD_COUNT_SHIFT;
>> - u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
>> + u32 delta = ((my_ticket_ - curr_) & WR_MASK);
>> if (likely(delta == 0))
>> break;
>>
>

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2010-11-17 01:30:48

by Cyberman Wu

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

On Mon, Nov 15, 2010 at 7:31 PM, Eric Dumazet <[email protected]> wrote:
> Le lundi 15 novembre 2010 ? 19:18 +0800, Cypher Wu a ?crit :
>> In that post I want to confirm another thing: if we join/leave on
>> different cores that every call will start the timer for IGMP message
>> using the same in_dev->mc_list, could that be optimized?
>>
>
> Which timer exactly ? Is it a real scalability problem ?
>
> I believe RTNL would be the blocking point actually...
>
>
>
>

struct in_device::mr_ifc_timer. Every time when a process join/leave a
MC group, igmp_ifc_event() -> igmp_ifc_start_timer() will start the
timer on the core that the system call issued, and
igmp_ifc_timer_expire() will be called on that core in the bottem halt
of timer interrupt.
If we call join/leave on mutlicores that timers will run on all these
cores, but it seems only one or two will generate IGMP message, others
will only lock the list and loop throught it with nothing generated.


--
Cyberman Wu
http://www.meganovo.com

2010-11-17 04:43:45

by Eric Dumazet

[permalink] [raw]
Subject: Re: Kernel rwlock design, Multicore and IGMP

Le mercredi 17 novembre 2010 à 09:30 +0800, Cypher Wu a écrit :
> O
> struct in_device::mr_ifc_timer. Every time when a process join/leave a
> MC group, igmp_ifc_event() -> igmp_ifc_start_timer() will start the
> timer on the core that the system call issued, and
> igmp_ifc_timer_expire() will be called on that core in the bottem halt
> of timer interrupt.
> If we call join/leave on mutlicores that timers will run on all these
> cores, but it seems only one or two will generate IGMP message, others
> will only lock the list and loop throught it with nothing generated.
>
>

Problem would not be timer being restarted on different cores (very
small impact), but scanning a list in igmpv3_send_cr() with many items
in it and expensive things, under timer handler (softirq), so adding
spikes of latency.

IGMP_Unsolicited_Report_Interval is 10 seconds, so we start timer in a 5
second average.

I am not sure there is a need to join/leave thousand of groups per
second anyway...


2010-11-22 05:39:53

by Cyberman Wu

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

2010/11/15 Chris Metcalf <[email protected]>:
> This avoids a deadlock in the IGMP code where one core gets a read
> lock, another core starts trying to get a write lock (thus blocking
> new readers), and then the first core tries to recursively re-acquire
> the read lock.
>
> We still try to preserve some degree of balance by giving priority
> to additional write lockers that come along while the lock is held
> for write, so they can all complete quickly and return the lock to
> the readers.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> This should apply relatively cleanly to 2.6.26.7 source code too.
>
> arch/tile/lib/spinlock_32.c | 29 ++++++++++++++++++-----------
> 1 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
> index 485e24d..5cd1c40 100644
> --- a/arch/tile/lib/spinlock_32.c
> +++ b/arch/tile/lib/spinlock_32.c
> @@ -167,23 +167,30 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
> * when we compare them.
> */
> u32 my_ticket_;
> + u32 iterations = 0;
>
> - /* Take out the next ticket; this will also stop would-be readers. */
> - if (val & 1)
> - val = get_rwlock(rwlock);
> - rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
> + /*
> + * Wait until there are no readers, then bump up the next
> + * field and capture the ticket value.
> + */
> + for (;;) {
> + if (!(val & 1)) {
> + if ((val >> RD_COUNT_SHIFT) == 0)
> + break;
> + rwlock->lock = val;
> + }
> + delay_backoff(iterations++);
> + val = __insn_tns((int *)&rwlock->lock);
> + }
>
> - /* Extract my ticket value from the original word. */
> + /* Take out the next ticket and extract my ticket value. */
> + rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
> my_ticket_ = val >> WR_NEXT_SHIFT;
>
> - /*
> - * Wait until the "current" field matches our ticket, and
> - * there are no remaining readers.
> - */
> + /* Wait until the "current" field matches our ticket. */
> for (;;) {
> u32 curr_ = val >> WR_CURR_SHIFT;
> - u32 readers = val >> RD_COUNT_SHIFT;
> - u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
> + u32 delta = ((my_ticket_ - curr_) & WR_MASK);
> if (likely(delta == 0))
> break;
>
> --
> 1.6.5.2
>
>


I've finished my business trip and tested that patch for more than an
hour and it works. The test is still running now.

But it seems there still has a potential problem: we used ticket lock
for write_lock(), and if there are so many write_lock() occurred, is
256 ticket enough for 64 or even more cores to avoiding overflow?
Since is we try to write_unlock() and there's already write_lock()
waiting we'll only adding current ticket.


--
Cyberman Wu
http://www.meganovo.com

2010-11-22 13:41:04

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

On 11/22/2010 12:39 AM, Cypher Wu wrote:
> 2010/11/15 Chris Metcalf <[email protected]>:
>> This avoids a deadlock in the IGMP code where one core gets a read
>> lock, another core starts trying to get a write lock (thus blocking
>> new readers), and then the first core tries to recursively re-acquire
>> the read lock.
>>
>> We still try to preserve some degree of balance by giving priority
>> to additional write lockers that come along while the lock is held
>> for write, so they can all complete quickly and return the lock to
>> the readers.
>>
>> Signed-off-by: Chris Metcalf <[email protected]>
>> ---
>> This should apply relatively cleanly to 2.6.26.7 source code too.
>> [...]
>
> I've finished my business trip and tested that patch for more than an
> hour and it works. The test is still running now.
>
> But it seems there still has a potential problem: we used ticket lock
> for write_lock(), and if there are so many write_lock() occurred, is
> 256 ticket enough for 64 or even more cores to avoiding overflow?
> Since is we try to write_unlock() and there's already write_lock()
> waiting we'll only adding current ticket.

This is OK, since each core can issue at most one (blocking) write_lock(),
and we have only 64 cores. Future >256 core machines will be based on
TILE-Gx anyway, which doesn't have the 256-core limit since it doesn't use
the spinlock_32.c implementation.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2010-11-23 01:36:34

by Cyberman Wu

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

2010/11/22 Chris Metcalf <[email protected]>:
> On 11/22/2010 12:39 AM, Cypher Wu wrote:
>> 2010/11/15 Chris Metcalf <[email protected]>:
>>> This avoids a deadlock in the IGMP code where one core gets a read
>>> lock, another core starts trying to get a write lock (thus blocking
>>> new readers), and then the first core tries to recursively re-acquire
>>> the read lock.
>>>
>>> We still try to preserve some degree of balance by giving priority
>>> to additional write lockers that come along while the lock is held
>>> for write, so they can all complete quickly and return the lock to
>>> the readers.
>>>
>>> Signed-off-by: Chris Metcalf <[email protected]>
>>> ---
>>> This should apply relatively cleanly to 2.6.26.7 source code too.
>>> [...]
>>
>> I've finished my business trip and tested that patch for more than an
>> hour and it works. The test is still running now.
>>
>> But it seems there still has a potential problem: we used ticket lock
>> for write_lock(), and if there are so many write_lock() occurred, is
>> 256 ticket enough for 64 or even more cores to avoiding overflow?
>> Since is we try to write_unlock() and there's already write_lock()
>> waiting we'll only adding current ticket.
>
> This is OK, since each core can issue at most one (blocking) write_lock(),
> and we have only 64 cores. Future >256 core machines will be based on
> TILE-Gx anyway, which doesn't have the 256-core limit since it doesn't use
> the spinlock_32.c implementation.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>

Say, if core A try to write_lock() rwlock and current_ticket_ is 0 and
it write next_ticket_ to 1, when it processing the lock, core B try to
write_lock() again and write next_ticket_ to 2, then when A
write_unlock() it seen that (current_ticket_+1) is not equal to
next_ticket_, so it increment current_ticket_, and core B get the
lock. If core A try write_lock again before core B write_unlock, it
will increment next_ticket_ to 3. And so on.
This may rarely happened, I've tested it yesterday for several hours
it goes very well under pressure.


--
Cyberman Wu
http://www.meganovo.com

2010-11-23 21:02:34

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

On 11/22/2010 8:36 PM, Cypher Wu wrote:
> Say, if core A try to write_lock() rwlock and current_ticket_ is 0 and
> it write next_ticket_ to 1, when it processing the lock, core B try to
> write_lock() again and write next_ticket_ to 2, then when A
> write_unlock() it seen that (current_ticket_+1) is not equal to
> next_ticket_, so it increment current_ticket_, and core B get the
> lock. If core A try write_lock again before core B write_unlock, it
> will increment next_ticket_ to 3. And so on.
> This may rarely happened, I've tested it yesterday for several hours
> it goes very well under pressure.

This should be OK when it happens (other than starving out the readers, but
that was the decision made by doing a ticket lock in the first place).
Even if we wrap around 255 back to zero on the tickets, the ticket queue
will work correctly. The key is not to need more than 256 concurrent write
lock waiters, which we don't.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2010-11-24 02:53:20

by Cyberman Wu

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

2010/11/24 Chris Metcalf <[email protected]>:
> On 11/22/2010 8:36 PM, Cypher Wu wrote:
>> Say, if core A try to write_lock() rwlock and current_ticket_ is 0 and
>> it write next_ticket_ to 1, when it processing the lock, core B try to
>> write_lock() again and write next_ticket_ to 2, then when A
>> write_unlock() it seen that (current_ticket_+1) is not equal to
>> next_ticket_, so it increment current_ticket_, and core B get the
>> lock. If core A try write_lock again before core B write_unlock, it
>> will increment next_ticket_ to 3. And so on.
>> This may rarely happened, I've tested it yesterday for several hours
>> it goes very well under pressure.
>
> This should be OK when it happens (other than starving out the readers, but
> that was the decision made by doing a ticket lock in the first place).
> Even if we wrap around 255 back to zero on the tickets, the ticket queue
> will work correctly. The key is not to need more than 256 concurrent write
> lock waiters, which we don't.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>

If we count on that, should we make 'my_ticket_ = (val >>
WR_NEXT_SHIFT) & WR_MASK;'?

--
Cyberman Wu
http://www.meganovo.com

2010-11-24 14:09:14

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

On 11/23/2010 9:53 PM, Cypher Wu wrote:
> 2010/11/24 Chris Metcalf <[email protected]>:
>> On 11/22/2010 8:36 PM, Cypher Wu wrote:
>>> Say, if core A try to write_lock() rwlock and current_ticket_ is 0 and
>>> it write next_ticket_ to 1, when it processing the lock, core B try to
>>> write_lock() again and write next_ticket_ to 2, then when A
>>> write_unlock() it seen that (current_ticket_+1) is not equal to
>>> next_ticket_, so it increment current_ticket_, and core B get the
>>> lock. If core A try write_lock again before core B write_unlock, it
>>> will increment next_ticket_ to 3. And so on.
>>> This may rarely happened, I've tested it yesterday for several hours
>>> it goes very well under pressure.
>> This should be OK when it happens (other than starving out the readers, but
>> that was the decision made by doing a ticket lock in the first place).
>> Even if we wrap around 255 back to zero on the tickets, the ticket queue
>> will work correctly. The key is not to need more than 256 concurrent write
>> lock waiters, which we don't.
> If we count on that, should we make 'my_ticket_ = (val >>
> WR_NEXT_SHIFT) & WR_MASK;'

No, it's OK. As the comment for the declaration of "my_ticket_" says, the
trailing underscore reminds us that the high bits are garbage, and when we
use the value, we do the mask: "((my_ticket_ - curr_) & WR_MASK)". It
turned out doing the mask here made the most sense from a code-generation
point of view, partly just because of the possibility of the counter wrapping.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2010-11-24 16:37:51

by Cyberman Wu

[permalink] [raw]
Subject: Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers

2010/11/24 Chris Metcalf <[email protected]>:
> On 11/23/2010 9:53 PM, Cypher Wu wrote:
>> 2010/11/24 Chris Metcalf <[email protected]>:
>>> On 11/22/2010 8:36 PM, Cypher Wu wrote:
>>>> Say, if core A try to write_lock() rwlock and current_ticket_ is 0 and
>>>> it write next_ticket_ to 1, when it processing the lock, core B try to
>>>> write_lock() again and write next_ticket_ to 2, then when A
>>>> write_unlock() it seen that (current_ticket_+1) is not equal to
>>>> next_ticket_, so it increment current_ticket_, and core B get the
>>>> lock. If core A try write_lock again before core B write_unlock, it
>>>> will increment next_ticket_ to 3. And so on.
>>>> This may rarely happened, I've tested it yesterday for several hours
>>>> it goes very well under pressure.
>>> This should be OK when it happens (other than starving out the readers, but
>>> that was the decision made by doing a ticket lock in the first place).
>>> Even if we wrap around 255 back to zero on the tickets, the ticket queue
>>> will work correctly. ?The key is not to need more than 256 concurrent write
>>> lock waiters, which we don't.
>> If we count on that, should we make 'my_ticket_ = (val >>
>> WR_NEXT_SHIFT) & WR_MASK;'
>
> No, it's OK. ?As the comment for the declaration of "my_ticket_" says, the
> trailing underscore reminds us that the high bits are garbage, and when we
> use the value, we do the mask: "((my_ticket_ - curr_) & WR_MASK)". ?It
> turned out doing the mask here made the most sense from a code-generation
> point of view, partly just because of the possibility of the counter wrapping.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>

If wrap does occurr direct subtraction may cause problem, but
write_lock() is usually only lock very little code, and since two
issue of that call will take us so many cycles that one the same core
the time eslapsed will be enough that wrap will never occurr.


--
Cyberman Wu
http://www.meganovo.com