Running a webserver benchmark on 2.5.57, I saw some things in the
profiles that I wasn't familliar with. Here's readprofile | sort | tail:
717 handle_IRQ_event 7.7935
732 do_gettimeofday 5.4627
755 __wake_up 10.7857
1089 del_timer 11.8370
1112 add_timer 7.8310
1500 system_call 34.0909
1847 schedule 2.4759
1873 kmap_atomic 12.8288
267475 __write_lock_failed 8358.5938
379837 __read_lock_failed 18991.8500
1044613 poll_idle 18010.5690
1704762 total 10.9631
I added a section at the ends of _raw_write_lock and _raw_read_lock
that look like this:
{
static int count = 0;
if( ++count%50000 == 0 ) {
printk("%s:%s() %d\n", __stringify(KBUILD_BASENAME),
__FUNCTION__, count );
dump_stack();
}
}
file_table:_raw_read_lock() 3300000
Call Trace:
[<c0152469>] fget+0x9d/0xa0
[<c0152b27>] sys_fsync+0x21/0xbe
[<c0151b53>] sys_writev+0x47/0x56
[<c010931f>] syscall_call+0x7/0xb
filemap:_raw_read_lock() 1450000
Call Trace:
[<c0136937>] do_generic_mapping_read+0x411/0x43e
[<c0136d98>] file_send_actor+0x0/0x74
[<c0136e74>] generic_file_sendfile+0x68/0x76
[<c0136d98>] file_send_actor+0x0/0x74
[<c0151d48>] do_sendfile+0x1e6/0x28a
[<c0136d98>] file_send_actor+0x0/0x74
[<c0151e50>] sys_sendfile+0x64/0xcc
[<c010931f>] syscall_call+0x7/0xb
ip_output:_raw_read_lock() 2000000
Call Trace:
[<c02c90b2>] ip_finish_output2+0x154/0x226
[<c02c7466>] ip_queue_xmit+0x3dc/0x4ce
[<c011c71a>] default_wake_function+0x32/0x3e
[<c011c75e>] __wake_up_common+0x38/0x58
[<c02ddf24>] tcp_v4_send_check+0x54/0xe2
[<c02d81b6>] tcp_transmit_skb+0x2be/0x448
[<c02d54ca>] tcp_data_queue+0x23a/0x830
[<c02da693>] tcp_send_ack+0x81/0xb2
[<c02d68d1>] tcp_rcv_established+0x249/0x70e
[<c02defd1>] tcp_v4_do_rcv+0x12d/0x132
[<c02df452>] tcp_v4_rcv+0x47c/0x50c
[<c02c4363>] ip_local_deliver_finish+0x9f/0x19e
[<c02c4674>] ip_rcv_finish+0x212/0x29f
[<c02b410e>] netif_receive_skb+0xc2/0x17c
[<c02b4245>] process_backlog+0x7d/0x10c
[<c02b4395>] net_rx_action+0xc1/0x178
[<c01248e7>] do_softirq+0xb7/0xba
[<c010b390>] do_IRQ+0xec/0xf8
[<c0106eca>] default_idle+0x0/0x2e
time:_raw_write_lock() 1350000
Call Trace:
[<c010f321>] timer_interrupt+0x99/0x9c
[<c010b150>] handle_IRQ_event+0x38/0x5c
[<c010b330>] do_IRQ+0x8c/0xf8
[<c0106eca>] default_idle+0x0/0x2e
[<c0106eca>] default_idle+0x0/0x2e
[<c0109c8c>] common_interrupt+0x18/0x20
[<c0106eca>] default_idle+0x0/0x2e
[<c0106eca>] default_idle+0x0/0x2e
[<c0106ef4>] default_idle+0x2a/0x2e
[<c0106f6b>] cpu_idle+0x39/0x42
[<c01212a5>] printk+0x15d/0x190
--
Dave Hansen
[email protected]
On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> file_table:_raw_read_lock() 3300000
> Call Trace:
> [<c0152469>] fget+0x9d/0xa0
> [<c0152b27>] sys_fsync+0x21/0xbe
> [<c0151b53>] sys_writev+0x47/0x56
> [<c010931f>] syscall_call+0x7/0xb
read_lock(&file->files_lock);
On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> filemap:_raw_read_lock() 1450000
> Call Trace:
> [<c0136937>] do_generic_mapping_read+0x411/0x43e
> [<c0136d98>] file_send_actor+0x0/0x74
> [<c0136e74>] generic_file_sendfile+0x68/0x76
> [<c0136d98>] file_send_actor+0x0/0x74
> [<c0151d48>] do_sendfile+0x1e6/0x28a
> [<c0136d98>] file_send_actor+0x0/0x74
> [<c0151e50>] sys_sendfile+0x64/0xcc
> [<c010931f>] syscall_call+0x7/0xb
read_lock(&mapping->page_lock);
On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> ip_output:_raw_read_lock() 2000000
> Call Trace:
> [<c02c90b2>] ip_finish_output2+0x154/0x226
> [<c02c7466>] ip_queue_xmit+0x3dc/0x4ce
> [<c011c71a>] default_wake_function+0x32/0x3e
> [<c011c75e>] __wake_up_common+0x38/0x58
> [<c02ddf24>] tcp_v4_send_check+0x54/0xe2
> [<c02d81b6>] tcp_transmit_skb+0x2be/0x448
> [<c02d54ca>] tcp_data_queue+0x23a/0x830
> [<c02da693>] tcp_send_ack+0x81/0xb2
> [<c02d68d1>] tcp_rcv_established+0x249/0x70e
> [<c02defd1>] tcp_v4_do_rcv+0x12d/0x132
> [<c02df452>] tcp_v4_rcv+0x47c/0x50c
> [<c02c4363>] ip_local_deliver_finish+0x9f/0x19e
> [<c02c4674>] ip_rcv_finish+0x212/0x29f
> [<c02b410e>] netif_receive_skb+0xc2/0x17c
> [<c02b4245>] process_backlog+0x7d/0x10c
> [<c02b4395>] net_rx_action+0xc1/0x178
> [<c01248e7>] do_softirq+0xb7/0xba
> [<c010b390>] do_IRQ+0xec/0xf8
> [<c0106eca>] default_idle+0x0/0x2e
read_lock_bh(&hh->hh_lock);
On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> time:_raw_write_lock() 1350000
> Call Trace:
> [<c010f321>] timer_interrupt+0x99/0x9c
> [<c010b150>] handle_IRQ_event+0x38/0x5c
> [<c010b330>] do_IRQ+0x8c/0xf8
> [<c0106eca>] default_idle+0x0/0x2e
> [<c0106eca>] default_idle+0x0/0x2e
> [<c0109c8c>] common_interrupt+0x18/0x20
> [<c0106eca>] default_idle+0x0/0x2e
> [<c0106eca>] default_idle+0x0/0x2e
> [<c0106ef4>] default_idle+0x2a/0x2e
> [<c0106f6b>] cpu_idle+0x39/0x42
> [<c01212a5>] printk+0x15d/0x190
read_lock_irqsave(&xtime_lock, flags)
or
write_lock_irq(&xtime_lock);
Bill
On Thu, Jan 16, 2003 at 04:47:30AM +0000, William Lee Irwin III wrote:
> On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> > file_table:_raw_read_lock() 3300000
> > Call Trace:
> > [<c0152469>] fget+0x9d/0xa0
> > [<c0152b27>] sys_fsync+0x21/0xbe
> > [<c0151b53>] sys_writev+0x47/0x56
> > [<c010931f>] syscall_call+0x7/0xb
>
> read_lock(&file->files_lock);
You mean read_lock(&files->file_lock); :)
Dave, does your webserver benchmark clone() tasks with CLONE_FILES ? Unless
the fd table is shared, can't see why there would be contention on this.
If it is indeed necessary to share fd table, then there is a somewhat
unmaintained lockfree fget() patch (files_struct_rcu) that you might want
to try.
>
> On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> > time:_raw_write_lock() 1350000
> > Call Trace:
> > [<c010f321>] timer_interrupt+0x99/0x9c
> > [<c010b150>] handle_IRQ_event+0x38/0x5c
>
> read_lock_irqsave(&xtime_lock, flags)
> or
> write_lock_irq(&xtime_lock);
ISTR a patch from Stephen Hemminger at OSDL that used Andrea's
sequence number trick based rwlock (frlock) to implement do_gettimeofday.
It might be relevant here.
Thanks
Dipankar
>> On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
>> > time:_raw_write_lock() 1350000
>> > Call Trace:
>> > [<c010f321>] timer_interrupt+0x99/0x9c
>> > [<c010b150>] handle_IRQ_event+0x38/0x5c
>>
>> read_lock_irqsave(&xtime_lock, flags)
>> or
>> write_lock_irq(&xtime_lock);
>
> ISTR a patch from Stephen Hemminger at OSDL that used Andrea's
> sequence number trick based rwlock (frlock) to implement do_gettimeofday.
> It might be relevant here.
I thought that was Andi. If it's the patch I'm thinking of, it's sitting
in the -mm tree.
M.
Hello,
On Thu, Jan 16, 2003 at 12:29:41PM +0530, Dipankar Sarma wrote:
> On Thu, Jan 16, 2003 at 04:47:30AM +0000, William Lee Irwin III wrote:
> > On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> > > file_table:_raw_read_lock() 3300000
> > > Call Trace:
> > > [<c0152469>] fget+0x9d/0xa0
> > > [<c0152b27>] sys_fsync+0x21/0xbe
> > > [<c0151b53>] sys_writev+0x47/0x56
> > > [<c010931f>] syscall_call+0x7/0xb
> >
> > read_lock(&file->files_lock);
>
> You mean read_lock(&files->file_lock); :)
>
> Dave, does your webserver benchmark clone() tasks with CLONE_FILES ? Unless
> the fd table is shared, can't see why there would be contention on this.
> If it is indeed necessary to share fd table, then there is a somewhat
> unmaintained lockfree fget() patch (files_struct_rcu) that you might want
> to try.
>
> >
> > On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> > > time:_raw_write_lock() 1350000
> > > Call Trace:
> > > [<c010f321>] timer_interrupt+0x99/0x9c
> > > [<c010b150>] handle_IRQ_event+0x38/0x5c
> >
> > read_lock_irqsave(&xtime_lock, flags)
> > or
> > write_lock_irq(&xtime_lock);
>
> ISTR a patch from Stephen Hemminger at OSDL that used Andrea's
> sequence number trick based rwlock (frlock) to implement do_gettimeofday.
I'm merging a version of Stephen's patch right now (the starvation of
the read locks that we rely when we don't clear irqs in read_locks that
can run from irqs too seems to hurt too much on some hardware/workload
combination to a point that it even lose ticks with the irq stuck, and
the frlock is the most efficient and scalable possible locking design
for gettimeofday and it will solve the starvation too, very good patch
Stephen).
While merging it I found a problem, not sure if it helps for your crash
but while checking it I found a quite fatal bug here:
+static inline unsigned fr_read_begin(frlock_t *rw)
+{
+ rmb();
+ return rw->post_sequence;
+
+}
the rmb() must be placed after (not before) reading the post_sequence.
The above bug could trigger on x86 too because it should even allow the
compiler to reorder stuff and the x86 can read speculative even if the
compiler doesn't reorder. This at the very least can explain screwed
timing results with such patch applied.
Andrea
Your analysis looks right, also, look at the ordering of the updates of
pre/post. They should match the vsyscall version on x86.
On Fri, 2003-01-17 at 08:17, Andrea Arcangeli wrote:
> Hello,
>
> On Thu, Jan 16, 2003 at 12:29:41PM +0530, Dipankar Sarma wrote:
> > On Thu, Jan 16, 2003 at 04:47:30AM +0000, William Lee Irwin III wrote:
> > > On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> > > > file_table:_raw_read_lock() 3300000
> > > > Call Trace:
> > > > [<c0152469>] fget+0x9d/0xa0
> > > > [<c0152b27>] sys_fsync+0x21/0xbe
> > > > [<c0151b53>] sys_writev+0x47/0x56
> > > > [<c010931f>] syscall_call+0x7/0xb
> > >
> > > read_lock(&file->files_lock);
> >
> > You mean read_lock(&files->file_lock); :)
> >
> > Dave, does your webserver benchmark clone() tasks with CLONE_FILES ? Unless
> > the fd table is shared, can't see why there would be contention on this.
> > If it is indeed necessary to share fd table, then there is a somewhat
> > unmaintained lockfree fget() patch (files_struct_rcu) that you might want
> > to try.
> >
> > >
> > > On Wed, Jan 15, 2003 at 08:18:13PM -0800, Dave Hansen wrote:
> > > > time:_raw_write_lock() 1350000
> > > > Call Trace:
> > > > [<c010f321>] timer_interrupt+0x99/0x9c
> > > > [<c010b150>] handle_IRQ_event+0x38/0x5c
> > >
> > > read_lock_irqsave(&xtime_lock, flags)
> > > or
> > > write_lock_irq(&xtime_lock);
> >
> > ISTR a patch from Stephen Hemminger at OSDL that used Andrea's
> > sequence number trick based rwlock (frlock) to implement do_gettimeofday.
>
> I'm merging a version of Stephen's patch right now (the starvation of
> the read locks that we rely when we don't clear irqs in read_locks that
> can run from irqs too seems to hurt too much on some hardware/workload
> combination to a point that it even lose ticks with the irq stuck, and
> the frlock is the most efficient and scalable possible locking design
> for gettimeofday and it will solve the starvation too, very good patch
> Stephen).
>
> While merging it I found a problem, not sure if it helps for your crash
> but while checking it I found a quite fatal bug here:
>
> +static inline unsigned fr_read_begin(frlock_t *rw)
> +{
> + rmb();
> + return rw->post_sequence;
> +
> +}
>
> the rmb() must be placed after (not before) reading the post_sequence.
> The above bug could trigger on x86 too because it should even allow the
> compiler to reorder stuff and the x86 can read speculative even if the
> compiler doesn't reorder. This at the very least can explain screwed
> timing results with such patch applied.
>
> Andrea
--
Stephen Hemminger <[email protected]>
Open Source Devlopment Lab