Hello list,
I have a P3, 256MB RAM system with 3 IDE disks attached, 2 identical
ones as hda and hdc (primary and secondary master), and the disc with
the OS partitions as primary slave hdb. For more info please refer to
the attached dmesg.txt. I attach several oprofile outputs that describe
various circumstances referenced later. The script I used to get them is
the attached script.sh.
The problem was encountered when I started two processes doing heavy I/O
on hda and hdc, "badblocks -v -w /dev/hda" and "badblocks -v -w
/dev/hdc". At the beginning (two_discs.txt) everything was fine and
vmstat reported more than 90% iowait CPU load. However, after a while
(two_discs_bad.txt) that some cron jobs kicked in, the image changed
completely: the cpu load was now about 60% system, and the rest was user
cpu load possibly going to the simple cron jobs.
Even though under normal circumstances (for example when running
badblocks on only one disc (one_disc.txt)) the cron jobs finish almost
instantaneously, this time they were simply never ending and every 10
minutes or so more and more jobs were being added to the process table.
One day later, the vmstat still reports about 60/40 system/user cpu load, all
processes still run (hundreds of them), and the load average is huge!
Another day later the OOM killer kicks in and kills various processes,
however never touches any badblocks process. Indeed, manually suspending
one badblocks process remedies the situation: within some seconds the
process table is cleared from cron jobs, cpu usage is back to 2-3% user
and ~90% iowait and the system is normally responsive again. This
happens no matter which badblocks process I suspend, being hda or hdc.
Any ideas about what could be wrong? I should note that the kernel is my
distro's default. As the problem seems to be scheduler specific I didn't
bother to compile a vanilla kernel, since the applied patches seem
irrelevant:
http://archlinux.org/packages/4197/
http://cvs.archlinux.org/cgi-bin/viewcvs.cgi/kernels/kernel26/?cvsroot=Current&only_with_tag=CURRENT
Thank in advance,
Dimitris
P.S.1: Please CC me directly as I'm not subscribed
P.S.2: Keep in mind that the problematic oprofile outputs probably refer
to much longer time than 5 sec, since due to high load the script was
taking long to complete.
P.S.3: I couldn't find anywhere in kernel documentation that setting
nmi_watchdog=0 is neccessary for oprofile to work correctly. However,
Documentation/nmi_watchdog.txt mentions that oprofile should disable the
nmi_watchdog automatically, which doesn't happen with the latest kernel.
Hello again,
was my report so complicated? Perhaps I shouldn't have included so many
oprofile outputs. Anyway, if anyone wants to have a look, the most important
is two_discs_bad.txt oprofile output, attached on my original message. The
problem is 100% reproducible for me so I would appreciate if anyone told me
he has similar experiences.
Thanks,
Dimitris
On Friday 03 August 2007 19:03:09 Dimitrios Apostolou wrote:
> Hello list,
>
> I have a P3, 256MB RAM system with 3 IDE disks attached, 2 identical
> ones as hda and hdc (primary and secondary master), and the disc with
> the OS partitions as primary slave hdb. For more info please refer to
> the attached dmesg.txt. I attach several oprofile outputs that describe
> various circumstances referenced later. The script I used to get them is
> the attached script.sh.
>
> The problem was encountered when I started two processes doing heavy I/O
> on hda and hdc, "badblocks -v -w /dev/hda" and "badblocks -v -w
> /dev/hdc". At the beginning (two_discs.txt) everything was fine and
> vmstat reported more than 90% iowait CPU load. However, after a while
> (two_discs_bad.txt) that some cron jobs kicked in, the image changed
> completely: the cpu load was now about 60% system, and the rest was user
> cpu load possibly going to the simple cron jobs.
>
> Even though under normal circumstances (for example when running
> badblocks on only one disc (one_disc.txt)) the cron jobs finish almost
> instantaneously, this time they were simply never ending and every 10
> minutes or so more and more jobs were being added to the process table.
> One day later, the vmstat still reports about 60/40 system/user cpu load,
> all processes still run (hundreds of them), and the load average is huge!
>
> Another day later the OOM killer kicks in and kills various processes,
> however never touches any badblocks process. Indeed, manually suspending
> one badblocks process remedies the situation: within some seconds the
> process table is cleared from cron jobs, cpu usage is back to 2-3% user
> and ~90% iowait and the system is normally responsive again. This
> happens no matter which badblocks process I suspend, being hda or hdc.
>
> Any ideas about what could be wrong? I should note that the kernel is my
> distro's default. As the problem seems to be scheduler specific I didn't
> bother to compile a vanilla kernel, since the applied patches seem
> irrelevant:
>
> http://archlinux.org/packages/4197/
> http://cvs.archlinux.org/cgi-bin/viewcvs.cgi/kernels/kernel26/?cvsroot=Curr
>ent&only_with_tag=CURRENT
>
>
> Thank in advance,
> Dimitris
>
>
> P.S.1: Please CC me directly as I'm not subscribed
>
> P.S.2: Keep in mind that the problematic oprofile outputs probably refer
> to much longer time than 5 sec, since due to high load the script was
> taking long to complete.
>
> P.S.3: I couldn't find anywhere in kernel documentation that setting
> nmi_watchdog=0 is neccessary for oprofile to work correctly. However,
> Documentation/nmi_watchdog.txt mentions that oprofile should disable the
> nmi_watchdog automatically, which doesn't happen with the latest kernel.
> Hello again,
Hi!
> was my report so complicated? Perhaps I shouldn't have included so many
> oprofile outputs. Anyway, if anyone wants to have a look, the most important
> is two_discs_bad.txt oprofile output, attached on my original message. The
> problem is 100% reproducible for me so I would appreciate if anyone told me
> he has similar experiences.
Probably nobody replied to Your message because people at this list think
that Your problem isn't kernel related. In this moment I'm using "Arch Linux"
too, so I checked /etc/cron directory. There simple jobs You are talking
about are not so simple:
- update the "locate" database,
- update the "whatis" database.
Both jobs are scaning "/" partition. I don't know how dcron works, but I can
imagine situation in which it is polling cron.daily and says: "hey it wasn't
done today yet" and it is starting same jobs over and over again. More
and more tasks scans the "/" partition and in result access is slower and
slower.
>
> Thanks,
> Dimitris
Let me know if I'm wrong
Rafał
----------------------------------------------------------------------
Zmien konto na takie o nieograniczonej pojemnosci.
Za darmo w INTERIA.PL
>>>http://link.interia.pl/f1b0a
On Sunday 05 August 2007 20:58:15 Rafał Bilski wrote:
> > Hello again,
>
> Hi!
>
> > was my report so complicated? Perhaps I shouldn't have included so many
> > oprofile outputs. Anyway, if anyone wants to have a look, the most
> > important is two_discs_bad.txt oprofile output, attached on my original
> > message. The problem is 100% reproducible for me so I would appreciate if
> > anyone told me he has similar experiences.
>
> Probably nobody replied to Your message because people at this list think
> that Your problem isn't kernel related. In this moment I'm using "Arch
> Linux" too, so I checked /etc/cron directory. There simple jobs You are
> talking about are not so simple:
> - update the "locate" database,
> - update the "whatis" database.
> Both jobs are scaning "/" partition. I don't know how dcron works, but I
> can imagine situation in which it is polling cron.daily and says: "hey it
> wasn't done today yet" and it is starting same jobs over and over again.
> More and more tasks scans the "/" partition and in result access is slower
> and slower.
Hello and thanks for your reply.
The cron job that is running every 10 min on my system is mpop (a
fetchmail-like program) and another running every 5 min is mrtg. Both
normally finish within 1-2 seconds.
The fact that these simple cron jobs don't finish ever is certainly because of
the high system CPU load. If you see the two_discs_bad.txt which I attached
on my original message, you'll see that *vmlinux*, and specifically the
*scheduler*, take up most time.
And the fact that this happens only when running two i/o processes but when
running only one everything is absolutely snappy (not at all slow, see
one_disc.txt), makes me sure that this is a kernel bug. I'd be happy to help
but I need some guidance to pinpoint the problem.
Thanks,
Dimitris
> Hello and thanks for your reply.
Hi,
> The cron job that is running every 10 min on my system is mpop (a
> fetchmail-like program) and another running every 5 min is mrtg. Both
> normally finish within 1-2 seconds.
>
> The fact that these simple cron jobs don't finish ever is certainly because of
> the high system CPU load. If you see the two_discs_bad.txt which I attached
> on my original message, you'll see that *vmlinux*, and specifically the
> *scheduler*, take up most time.
>
> And the fact that this happens only when running two i/o processes but when
> running only one everything is absolutely snappy (not at all slow, see
> one_disc.txt), makes me sure that this is a kernel bug. I'd be happy to help
> but I need some guidance to pinpoint the problem.
OK, but first can You try to fix Your cron daemon? Just make sure that if mpop
is already started it won't be started again. Maybe something like "pgrep mpop"
and "if [ $?".
I don't remember exactly, but some time ago somebody had problem with to large
disk buffers and sync(). Check LKML archives. MPOP is doing fsync().
You have VIA chipset. Me too. It isn't very reliable. Don't You have something
like "error { d0 BUSY }" in dmesg? This would explain high CPU load. Simply
DMA isn't used after such error and disk goes to PIO mode. On two disk system
load is about 4.0 in this case. Simple program takes hours to complete if
there is havy I/O in progress. Btw. SLUB seems to behave better in this
situation (at least up to 8.0).
> Thanks,
> Dimitris
Regards
Rafa?
----------------------------------------------------------------------
Dowiedz sie, co naprawde podnieca kobiety. Wiecej wiesz, latwiej je
oczarujesz
>>>http://link.interia.pl/f1b17
On Sun, 5 Aug 2007 19:03:12 +0300 Dimitrios Apostolou <[email protected]> wrote:
> was my report so complicated?
We're bad.
Seems that your context switch rate when running two instances of
badblocks against two different disks went batshit insane. It doesn't
happen here.
Please capture the `vmstat 1' output while running the problematic
workload.
The oom-killing could have been unrelated to the CPU load problem. iirc
badblocks uses a lot of memory, so it might have been genuine. Keep an eye
on the /proc/meminfo output and send the kernel dmesg output from the
oom-killing event.
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 2 0 25428 123288 69092 0 0 7 27 44 19 3 0 96 0
2 2 0 25420 123288 69096 0 0 0 15744 273 421 0 7 0 93
2 2 0 25420 123288 69096 0 0 0 15872 276 429 0 4 0 96
2 2 0 25420 123288 69096 0 0 0 15872 273 394 0 2 0 98
2 2 0 25420 123288 69096 0 0 0 15872 277 430 0 2 0 98
1 2 0 25420 123288 69096 0 0 0 16000 273 496 2 10 0 88
2 2 0 25360 123292 69096 0 0 0 15996 288 508 0 4 0 96
2 2 0 25360 123292 69096 0 0 0 16000 283 487 0 3 0 97
2 2 0 25360 123292 69096 0 0 0 15872 279 452 0 2 0 98
2 2 0 25360 123292 69096 0 0 0 15872 283 442 0 2 0 98
Mon Aug 6 17:32:15 EEST 2007
Using default event: CPU_CLK_UNHALTED:100000:0:1:1
Daemon started.
Profiler running.
Stopping profiling.
Killing daemon.
CPU: PIII, speed 798.031 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
samples| %|
------------------
1893 43.9415 processor
1409 32.7066 vmlinux
346 8.0316 libc-2.6.so
217 5.0371 bash
CPU_CLK_UNHALT...|
samples| %|
------------------
216 99.5392 bash
1 0.4608 anon (tgid:29407 range:0xb7f90000-0xb7f91000)
141 3.2730 ld-2.6.so
105 2.4373 ide_core
43 0.9981 ext3
42 0.9749 ISO8859-1.so
24 0.5571 jbd
23 0.5339 oprofiled
CPU_CLK_UNHALT...|
samples| %|
------------------
22 95.6522 oprofiled
1 4.3478 anon (tgid:29402 range:0xb7f1d000-0xb7f1e000)
18 0.4178 oprofile
9 0.2089 reiserfs
6 0.1393 libcrypto.so.0.9.8
5 0.1161 grep
4 0.0929 locale-archive
3 0.0696 gawk
3 0.0696 badblocks
3 0.0696 screen-4.0.3
3 0.0696 imap-login
CPU_CLK_UNHALT...|
samples| %|
------------------
1 33.3333 imap-login
1 33.3333 anon (tgid:28108 range:0xb7eea000-0xb7eeb000)
1 33.3333 anon (tgid:3944 range:0xb7fc8000-0xb7fc9000)
2 0.0464 libpthread-2.6.so
1 0.0232 less
1 0.0232 ls
1 0.0232 rm
1 0.0232 sleep
1 0.0232 libext2fs.so.2.4
1 0.0232 libncurses.so.5.6
1 0.0232 init
1 0.0232 sshd
1 0.0232 syslog-ng
CPU: PIII, speed 798.031 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
162 11.4975 do_wp_page
147 10.4329 native_safe_halt
49 3.4776 __blockdev_direct_IO
45 3.1938 __handle_mm_fault
34 2.4131 put_page
33 2.3421 get_page_from_freelist
27 1.9163 page_fault
24 1.7033 unmap_vmas
22 1.5614 __bio_add_page
20 1.4194 acpi_pm_read
20 1.4194 follow_page
19 1.3485 copy_process
19 1.3485 find_get_page
18 1.2775 __link_path_walk
17 1.2065 irq_entries_start
16 1.1356 delay_tsc
15 1.0646 __d_lookup
15 1.0646 mask_and_ack_8259A
14 0.9936 copy_page_range
14 0.9936 kmem_cache_alloc
14 0.9936 page_address
13 0.9226 dio_send_cur_page
13 0.9226 find_vma
13 0.9226 strnlen_user
12 0.8517 do_page_fault
12 0.8517 radix_tree_lookup
12 0.8517 submit_page_section
11 0.7807 acpi_os_read_port
11 0.7807 schedule
10 0.7097 __copy_to_user_ll
10 0.7097 __find_get_block
10 0.7097 dio_get_page
10 0.7097 do_mmap_pgoff
10 0.7097 native_io_delay
9 0.6388 blk_recount_segments
9 0.6388 filemap_nopage
8 0.5678 bio_add_page
8 0.5678 dio_bio_add_page
8 0.5678 load_elf_binary
8 0.5678 memcpy
7 0.4968 pit_next_event
6 0.4258 acpi_os_write_port
6 0.4258 bio_alloc_bioset
6 0.4258 free_pgtables
5 0.3549 kmem_cache_free
5 0.3549 mark_page_accessed
5 0.3549 restore_nocheck
5 0.3549 strncpy_from_user
5 0.3549 sysenter_past_esp
4 0.2839 __copy_from_user_ll
4 0.2839 __inc_zone_state
4 0.2839 __mutex_lock_slowpath
4 0.2839 _spin_lock_irqsave
4 0.2839 blk_rq_map_sg
4 0.2839 current_fs_time
4 0.2839 end_that_request_last
4 0.2839 error_code
4 0.2839 generic_make_request
4 0.2839 generic_permission
4 0.2839 get_user_pages
4 0.2839 getnstimeofday
4 0.2839 hweight32
4 0.2839 max_block
4 0.2839 sched_clock
4 0.2839 try_to_wake_up
4 0.2839 vfs_read
4 0.2839 vma_merge
3 0.2129 __alloc_pages
3 0.2129 __find_get_block_slow
3 0.2129 __fput
3 0.2129 bio_init
3 0.2129 cache_alloc_refill
3 0.2129 cfq_insert_request
3 0.2129 cond_resched
3 0.2129 do_sys_open
3 0.2129 dput
3 0.2129 flush_tlb_mm
3 0.2129 free_page_and_swap_cache
3 0.2129 generic_file_aio_read
3 0.2129 get_next_timer_interrupt
3 0.2129 hrtimer_get_next_event
3 0.2129 init_request_from_bio
3 0.2129 internal_add_timer
3 0.2129 kmem_cache_zalloc
3 0.2129 ktime_get_ts
3 0.2129 page_remove_rmap
3 0.2129 permission
3 0.2129 set_page_dirty_lock
3 0.2129 sys_close
3 0.2129 sys_mprotect
3 0.2129 touch_atime
3 0.2129 unlock_page
3 0.2129 zone_watermark_ok
2 0.1419 __add_entropy_words
2 0.1419 __atomic_notifier_call_chain
2 0.1419 __dec_zone_state
2 0.1419 __do_page_cache_readahead
2 0.1419 __end_that_request_first
2 0.1419 __make_request
2 0.1419 __switch_to
2 0.1419 __wake_up_bit
2 0.1419 _atomic_dec_and_lock
2 0.1419 anon_vma_prepare
2 0.1419 bio_get_nr_vecs
2 0.1419 copy_strings
2 0.1419 copy_to_user
2 0.1419 d_alloc
2 0.1419 dio_bio_complete
2 0.1419 dnotify_parent
2 0.1419 do_notify_resume
2 0.1419 do_sigaction
2 0.1419 do_sys_poll
2 0.1419 down_read_trylock
2 0.1419 elv_dispatch_sort
2 0.1419 enqueue_hrtimer
2 0.1419 find_busiest_group
2 0.1419 find_vma_prepare
2 0.1419 flush_tlb_page
2 0.1419 free_uid
2 0.1419 generic_fillattr
2 0.1419 kmap_atomic_prot
2 0.1419 native_flush_tlb_single
2 0.1419 new_inode
2 0.1419 path_release
2 0.1419 pipe_write
2 0.1419 prio_tree_insert
2 0.1419 proc_lookup
2 0.1419 quicklist_trim
2 0.1419 rb_erase
2 0.1419 rb_insert_color
2 0.1419 recalc_task_prio
2 0.1419 restore_sigcontext
2 0.1419 ret_from_intr
2 0.1419 rw_verify_area
2 0.1419 sys_brk
2 0.1419 task_rq_lock
2 0.1419 tick_nohz_update_jiffies
2 0.1419 update_wall_time
2 0.1419 vm_normal_page
2 0.1419 vma_link
2 0.1419 vma_prio_tree_add
2 0.1419 wake_up_new_task
1 0.0710 __anon_vma_link
1 0.0710 __copy_user_intel
1 0.0710 __d_path
1 0.0710 __dentry_open
1 0.0710 __free_pages_ok
1 0.0710 __freed_request
1 0.0710 __get_user_4
1 0.0710 __init_rwsem
1 0.0710 __kmalloc
1 0.0710 __page_set_anon_rmap
1 0.0710 __pagevec_lru_add
1 0.0710 __pagevec_lru_add_active
1 0.0710 __rb_rotate_left
1 0.0710 __sk_dst_check
1 0.0710 __tasklet_schedule
1 0.0710 __vm_enough_memory
1 0.0710 __vma_link_rb
1 0.0710 __wait_on_bit
1 0.0710 __wake_up
1 0.0710 __wake_up_common
1 0.0710 _d_rehash
1 0.0710 acpi_hw_low_level_read
1 0.0710 acpi_hw_register_read
1 0.0710 add_timer_randomness
1 0.0710 alloc_pid
1 0.0710 anon_vma_unlink
1 0.0710 atomic_notifier_call_chain
1 0.0710 attach_pid
1 0.0710 bio_alloc
1 0.0710 bit_waitqueue
1 0.0710 blk_backing_dev_unplug
1 0.0710 blk_queue_bounce
1 0.0710 block_llseek
1 0.0710 block_read_full_page
1 0.0710 cache_reap
1 0.0710 cached_lookup
1 0.0710 call_rcu
1 0.0710 cfq_may_queue
1 0.0710 cfq_merge
1 0.0710 cfq_queue_empty
1 0.0710 clockevents_program_event
1 0.0710 cp_new_stat64
1 0.0710 cpu_idle
1 0.0710 create_read_pipe
1 0.0710 current_io_context
1 0.0710 d_path
1 0.0710 debug_mutex_free_waiter
1 0.0710 del_timer
1 0.0710 dequeue_signal
1 0.0710 dequeue_task
1 0.0710 detach_pid
1 0.0710 dio_complete
1 0.0710 dio_new_bio
1 0.0710 dio_zero_block
1 0.0710 disk_round_stats
1 0.0710 dnotify_flush
1 0.0710 do_IRQ
1 0.0710 do_brk
1 0.0710 do_exit
1 0.0710 do_fork
1 0.0710 do_generic_mapping_read
1 0.0710 do_syslog
1 0.0710 do_wait
1 0.0710 dummy_inode_alloc_security
1 0.0710 dummy_vm_enough_memory
1 0.0710 dup_fd
1 0.0710 effective_prio
1 0.0710 elv_insert
1 0.0710 elv_rqhash_add
1 0.0710 enable_8259A_irq
1 0.0710 exit_aio
1 0.0710 exit_mmap
1 0.0710 exit_thread
1 0.0710 fget
1 0.0710 fget_light
1 0.0710 filp_close
1 0.0710 find_extend_vma
1 0.0710 find_next_bit
1 0.0710 find_next_zero_bit
1 0.0710 find_vma_prev
1 0.0710 flush_signal_handlers
1 0.0710 fput
1 0.0710 free_hot_page
1 0.0710 free_pages_bulk
1 0.0710 free_pgd_range
1 0.0710 free_pipe_info
1 0.0710 generic_drop_inode
1 0.0710 generic_file_buffered_write
1 0.0710 generic_file_open
1 0.0710 generic_unplug_device
1 0.0710 get_empty_filp
1 0.0710 get_index
1 0.0710 get_nr_files
1 0.0710 get_request
1 0.0710 get_signal_to_deliver
1 0.0710 get_unused_fd
1 0.0710 getname
1 0.0710 half_md4_transform
1 0.0710 hrtimer_init
1 0.0710 hrtimer_start
1 0.0710 init_new_context
1 0.0710 init_waitqueue_head
1 0.0710 inode_setattr
1 0.0710 inotify_d_instantiate
1 0.0710 install_special_mapping
1 0.0710 ktime_get
1 0.0710 kunmap_atomic
1 0.0710 lapic_next_event
1 0.0710 lock_hrtimer_base
1 0.0710 lock_timer_base
1 0.0710 locks_remove_posix
1 0.0710 may_open
1 0.0710 memory_open
1 0.0710 mempool_alloc
1 0.0710 mempool_alloc_slab
1 0.0710 mm_init
1 0.0710 mm_release
1 0.0710 mntput_no_expire
1 0.0710 native_set_pte_at
1 0.0710 notifier_call_chain
1 0.0710 ns_to_timespec
1 0.0710 open_exec
1 0.0710 open_namei
1 0.0710 page_waitqueue
1 0.0710 percpu_counter_mod
1 0.0710 pipe_release
1 0.0710 proc_flush_task
1 0.0710 proc_sys_lookup_table_one
1 0.0710 profile_munmap
1 0.0710 radix_tree_insert
1 0.0710 radix_tree_preload
1 0.0710 rb_first
1 0.0710 read_chan
1 0.0710 recalc_bh_state
1 0.0710 recalc_sigpending_tsk
1 0.0710 release_task
1 0.0710 resched_task
1 0.0710 resume_kernel
1 0.0710 resume_userspace
1 0.0710 rt_set_nexthop
1 0.0710 secure_ip_id
1 0.0710 select_nohz_load_balancer
1 0.0710 send_signal
1 0.0710 seq_path
1 0.0710 seq_printf
1 0.0710 set_page_dirty
1 0.0710 should_remove_suid
1 0.0710 sigprocmask
1 0.0710 sock_aio_write
1 0.0710 submit_bio
1 0.0710 sys_fstat64
1 0.0710 sys_llseek
1 0.0710 sys_poll
1 0.0710 sys_rt_sigprocmask
1 0.0710 tcp_poll
1 0.0710 tick_do_update_jiffies64
1 0.0710 tick_nohz_restart_sched_tick
1 0.0710 tick_nohz_stop_sched_tick
1 0.0710 tick_sched_timer
1 0.0710 tty_ldisc_try
1 0.0710 tty_wakeup
1 0.0710 vfs_getattr
1 0.0710 vfs_llseek
1 0.0710 vfs_permission
1 0.0710 vfs_write
1 0.0710 vmtruncate
1 0.0710 vsnprintf
1 0.0710 wake_up_bit
1 0.0710 wake_up_process
1 0.0710 write_cache_pages
> Hello and thanks for your reply.
Hello again,
> The cron job that is running every 10 min on my system is mpop (a
> fetchmail-like program) and another running every 5 min is mrtg. Both
> normally finish within 1-2 seconds.
>
> The fact that these simple cron jobs don't finish ever is certainly because of
> the high system CPU load. If you see the two_discs_bad.txt which I attached
> on my original message, you'll see that *vmlinux*, and specifically the
> *scheduler*, take up most time.
>
> And the fact that this happens only when running two i/o processes but when
> running only one everything is absolutely snappy (not at all slow, see
> one_disc.txt), makes me sure that this is a kernel bug. I'd be happy to help
> but I need some guidance to pinpoint the problem.
In Your oprofile output I find "acpi_pm_read" particulary interesting. Unlike
other VIA chipsets, which I know, Your doesn't use VLink to connect northbridge
to southbridge. Instead PCI bus connects these two. As You probably know
maximal PCI throughtput is 133MiB/s. In theory. In practice probably less.
ACPI registers are located on southbridge. This probably means that processor
needs access to PCI bus in order to read ACPI timer register.
Now some math. 20GiB disk probably can send data at 20MiB/s rate. 200GiB
disk probably about 40MiB/s. So 20+2*40=100MiB/s. I think that this could
explain why simple inl() call takes so much time and why Your system isn't
very responsive.
> Thanks,
> Dimitris
Let me know if You find my theory amazing or amusing.
Rafał
----------------------------------------------------------------------
Kobiety klamia o wiele skuteczniej niz mezczyzni.
Sprawdz, jak sie na nich poznac
>>>http://link.interia.pl/f1b16
On Mon, 06 Aug 2007 16:20:30 +0200 Dimitrios Apostolou <[email protected]> wrote:
> Andrew Morton wrote:
> > On Sun, 5 Aug 2007 19:03:12 +0300 Dimitrios Apostolou <[email protected]> wrote:
> >
> >> was my report so complicated?
> >
> > We're bad.
> >
> > Seems that your context switch rate when running two instances of
> > badblocks against two different disks went batshit insane. It doesn't
> > happen here.
> >
> > Please capture the `vmstat 1' output while running the problematic
> > workload.
> >
> > The oom-killing could have been unrelated to the CPU load problem. iirc
> > badblocks uses a lot of memory, so it might have been genuine. Keep an eye
> > on the /proc/meminfo output and send the kernel dmesg output from the
> > oom-killing event.
>
> Please see the attached files. Unfortunately I don't see any useful info
> in them:
> *_before: before running any badblocks process
> *_while: while running badblocks process, but without any cron job
> having kicked in
> *_bad: 5 minutes later that some cron jobs kicked in
>
> About the OOM killer, indeed I believe that it is unrelated. It started
> killing after about 2 days, that hundreds of processes were stuck as
> running and taking up memory, so I suppose the 256 MB RAM were truly
> filled. I just mentioned it because its behaviour is completely
> non-helpful. It doesn't touch the badblocks process, it rarely touches
> the stuck as running cron jobs, but it kills other irrelevant processes.
> If you still want the killing logs, tell me and I'll search for them.
ah. Your context-switch rate during the dual-badblocks run is not high at
all.
I suspect I was fooled by the oprofile output, which showed tremendous
amounts of load in schedule() and switch_to(). The percentages which
opreport shows are the percentage of non-halted CPU time. So if you have a
function in the kernel which is using 1% of the total CPU, and the CPU is
halted for 95% of the time, it appears that the function is taking 20% of
CPU!
The fix for that is to boot with the "idle=poll" boot parameter, to make
the CPU spin when it has nothing else to do.
I'm suspecting that your machine is just stuck in D state waiting for disk.
Did we have a sysrq-T trace?
Rafał Bilski wrote:
>> Hello and thanks for your reply.
> Hello again,
>> The cron job that is running every 10 min on my system is mpop (a
>> fetchmail-like program) and another running every 5 min is mrtg. Both
>> normally finish within 1-2 seconds.
>> The fact that these simple cron jobs don't finish ever is certainly
>> because of the high system CPU load. If you see the two_discs_bad.txt
>> which I attached on my original message, you'll see that *vmlinux*,
>> and specifically the *scheduler*, take up most time.
>> And the fact that this happens only when running two i/o processes but
>> when running only one everything is absolutely snappy (not at all
>> slow, see one_disc.txt), makes me sure that this is a kernel bug. I'd
>> be happy to help but I need some guidance to pinpoint the problem.
> In Your oprofile output I find "acpi_pm_read" particulary interesting.
> Unlike other VIA chipsets, which I know, Your doesn't use VLink to
> connect northbridge to southbridge. Instead PCI bus connects these two.
> As You probably know maximal PCI throughtput is 133MiB/s. In theory. In
> practice probably less.
> ACPI registers are located on southbridge. This probably means that
> processor needs access to PCI bus in order to read ACPI timer register.
> Now some math. 20GiB disk probably can send data at 20MiB/s rate. 200GiB
> disk probably about 40MiB/s. So 20+2*40=100MiB/s. I think that this
> could explain why simple inl() call takes so much time and why Your
> system isn't very responsive.
>> Thanks, Dimitris
> Let me know if You find my theory amazing or amusing.
Hello Rafal,
I find your theory very nice, but unfortunately I don't think it applies
here. As you can see from the vmstat outputs the write throughput is
about 15MB/s for both disks. When reading I get about 30MB/s again from
both disks. The other disk, the small one, is mostly idle, except for
writing little bits and bytes now and then. Since the problem occurs
when writing, 15MB/s is just too little I think for the PCI bus.
However I find it quite possible to have reached the throughput limit
because of software (driver) problems. I have done various testing
(mostly "hdparm -tT" with exactly the same PC and disks since about
kernel 2.6.8 (maybe even earlier). I remember with certainty that read
throughput the early days was about 50MB/s for each of the big disks,
and combined with RAID 0 I got ~75MB/s. Those figures have been dropping
gradually with each new kernel release and the situation today, with
2.6.22, is that hdparm gives maximum throughput 20MB/s for each disk,
and for RAID 0 too!
I have been ignoring these performance regressions because of no
stability problems until now. So could it be that I'm reaching the
20MB/s driver limit and some requests take too long to be served?
Dimitris
Hi,
Andrew Morton wrote:
> I suspect I was fooled by the oprofile output, which showed tremendous
> amounts of load in schedule() and switch_to(). The percentages which
> opreport shows are the percentage of non-halted CPU time. So if you have a
> function in the kernel which is using 1% of the total CPU, and the CPU is
> halted for 95% of the time, it appears that the function is taking 20% of
> CPU!
>
> The fix for that is to boot with the "idle=poll" boot parameter, to make
> the CPU spin when it has nothing else to do.
I'll test again the two_discs_bad situation after booting with that
parameter. Thanks.
>
> I'm suspecting that your machine is just stuck in D state waiting for disk.
> Did we have a sysrq-T trace?
The amazing thing is that this doesn't happen! Every single cron jobs
that keeps running (I intentionally said that before too) and never
ends is in R state. By strace'ing the processes they just seem to be
going *extremely* slow. I also changed the I/O elevator of hdb (the OS
disk) to deadline from cfq, unfortunately with no results. That is why I
've been considering it a CPU scheduler issue.
Dimitris
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
4 2 0 161676 9716 55700 0 0 248 2020 50 131 5 4 84 7
4 2 0 161668 9716 55704 0 0 0 16000 253 504 80 20 0 0
3 2 0 161668 9716 55704 0 0 0 16000 249 522 66 34 0 0
3 2 0 161608 9716 55704 0 0 0 15936 251 508 80 20 0 0
3 2 0 161548 9716 55704 0 0 0 15936 249 523 73 28 0 0
4 2 0 161488 9716 55704 0 0 0 15936 250 517 84 16 0 0
3 2 0 161488 9724 55704 0 0 0 15876 252 469 57 43 0 0
4 2 0 161428 9724 55704 0 0 0 15872 253 504 60 40 0 0
3 2 0 161428 9724 55704 0 0 0 15936 248 500 24 76 0 0
4 2 0 161428 9740 55704 0 0 0 15840 258 475 52 48 0 0
> > In Your oprofile output I find "acpi_pm_read" particulary interesting.
> > Unlike other VIA chipsets, which I know, Your doesn't use VLink to
> > connect northbridge to southbridge. Instead PCI bus connects these two.
> > As You probably know maximal PCI throughtput is 133MiB/s. In theory. In
> > practice probably less.
acpi_pm_read is capable of disappearing into SMM traps which will make
it look very slow.
> about 15MB/s for both disks. When reading I get about 30MB/s again from
> both disks. The other disk, the small one, is mostly idle, except for
> writing little bits and bytes now and then. Since the problem occurs
> when writing, 15MB/s is just too little I think for the PCI bus.
Its about right for some of the older VIA chipsets but if you are seeing
speed loss then we need to know precisely which kernels the speed dropped
at. Could be there is an I/O scheduling issue your system shows up or
some kind of PCI bus contention when both disks are active at once.
> I have been ignoring these performance regressions because of no
> stability problems until now. So could it be that I'm reaching the
> 20MB/s driver limit and some requests take too long to be served?
Nope.
Alan
> Hello Rafal,
Hello,
> However I find it quite possible to have reached the throughput limit
> because of software (driver) problems. I have done various testing
> (mostly "hdparm -tT" with exactly the same PC and disks since about
> kernel 2.6.8 (maybe even earlier). I remember with certainty that read
> throughput the early days was about 50MB/s for each of the big disks,
> and combined with RAID 0 I got ~75MB/s. Those figures have been dropping
> gradually with each new kernel release and the situation today, with
> 2.6.22, is that hdparm gives maximum throughput 20MB/s for each disk,
> and for RAID 0 too!
Just tested (plain curiosity).
via82cxxx average result @533MHz:
/dev/hda:
Timing cached reads: 232 MB in 2.00 seconds = 115.93 MB/sec
Timing buffered disk reads: 64 MB in 3.12 seconds = 20.54 MB/sec
pata_via average result @533MHz:
/dev/sda:
Timing cached reads: 234 MB in 2.01 seconds = 116.27 MB/sec
Timing buffered disk reads: 82 MB in 3.05 seconds = 26.92 MB/sec
Same 2.6.23-rc1-git11 kernel.
Yes - constant 6MB/s difference (31%). Cool.
> Dimitris
Regards
Rafał
----------------------------------------------------------------------
Jestes sexy? Dodaj swoje fotki i daj sie ocenic na
>>>http://link.interia.pl/f1b21
Hi Alan,
Alan Cox wrote:
>>> In Your oprofile output I find "acpi_pm_read" particulary interesting.
>>> Unlike other VIA chipsets, which I know, Your doesn't use VLink to
>>> connect northbridge to southbridge. Instead PCI bus connects these two.
>>> As You probably know maximal PCI throughtput is 133MiB/s. In theory. In
>>> practice probably less.
>
> acpi_pm_read is capable of disappearing into SMM traps which will make
> it look very slow.
what is an SMM trap? I googled a bit but didn't get it...
>
>> about 15MB/s for both disks. When reading I get about 30MB/s again from
>> both disks. The other disk, the small one, is mostly idle, except for
>> writing little bits and bytes now and then. Since the problem occurs
>> when writing, 15MB/s is just too little I think for the PCI bus.
>
> Its about right for some of the older VIA chipsets but if you are seeing
> speed loss then we need to know precisely which kernels the speed dropped
> at. Could be there is an I/O scheduling issue your system shows up or
> some kind of PCI bus contention when both disks are active at once.
I am sure throughput kept diminishing little by little with many 2.6
releases, and that it wasn't a major regression on a specific version.
Unfortunately I cannot backup my words with measurements from older
kernels right now, since the system is hard to boot with such (new udev,
new glibc). However I promise I'll test in the future (probably using
old liveCDs) and come back then with proof.
>
>> I have been ignoring these performance regressions because of no
>> stability problems until now. So could it be that I'm reaching the
>> 20MB/s driver limit and some requests take too long to be served?
>
> Nope.
the reason I'm talking about a "software driver limit" is because I am
sure about some facts:
- The disks can reach very high speeds (60 MB/s on other systems with udma5)
- The chipset on this specific motherboard can reach much higher
numbers, as was measured with old kernels.
- No cable problems (have been changed), no strange dmesg output.
So what is left? Probably only the corresponding kernel module.
Thanks,
Dimitris
Rafał Bilski wrote:
>> Hello Rafal,
> Hello,
>> However I find it quite possible to have reached the throughput limit
>> because of software (driver) problems. I have done various testing
>> (mostly "hdparm -tT" with exactly the same PC and disks since about
>> kernel 2.6.8 (maybe even earlier). I remember with certainty that read
>> throughput the early days was about 50MB/s for each of the big disks,
>> and combined with RAID 0 I got ~75MB/s. Those figures have been
>> dropping gradually with each new kernel release and the situation
>> today, with 2.6.22, is that hdparm gives maximum throughput 20MB/s for
>> each disk, and for RAID 0 too!
> Just tested (plain curiosity).
> via82cxxx average result @533MHz:
> /dev/hda:
> Timing cached reads: 232 MB in 2.00 seconds = 115.93 MB/sec
> Timing buffered disk reads: 64 MB in 3.12 seconds = 20.54 MB/sec
> pata_via average result @533MHz:
> /dev/sda:
> Timing cached reads: 234 MB in 2.01 seconds = 116.27 MB/sec
> Timing buffered disk reads: 82 MB in 3.05 seconds = 26.92 MB/sec
Interesting! I haven't tried libata myself on that system, I only have
remote access to it so I'm a bit afraid...
Rafal, I hope that system you run hdparm on isn't the archlinux one! Is
it easy to load an old kernel (even two years old) and do the same test?
If it is, please let me know of the results.
Thanks in advance,
Dimitris
> > acpi_pm_read is capable of disappearing into SMM traps which will make
> > it look very slow.
>
> what is an SMM trap? I googled a bit but didn't get it...
One of the less documented bits of the PC architecture. It is possible to
arrange that the CPU jumps into a special mode when triggered by some
specific external event. Originally this was used for stuff like APM and
power management but some laptops use it for stuff like faking the
keyboard interface and the Geode uses it for tons of stuff.
As SMM mode is basically invisible to the OS what oprofile and friends
see isn't what really occurs. So you see
pci write -> some address
you don't then see
SMM
CPU saves processor state
Lots of code runs (eg i2c polling the battery)
code executes RSM
Back to the OS
and the next visible profile point. This can make an I/O operation look
really slow even if it isn't the I/O which is slow.
> the reason I'm talking about a "software driver limit" is because I am
> sure about some facts:
> - The disks can reach very high speeds (60 MB/s on other systems with udma5)
Is UDMA5 being selected firstly ?
> So what is left? Probably only the corresponding kernel module.
Unlikely to be the disk driver as that really hasn't changed tuning for a
very long time. I/O scheduler interactions are however very possible.
>> Just tested (plain curiosity).
>> via82cxxx average result @533MHz:
>> /dev/hda:
>> Timing cached reads: 232 MB in 2.00 seconds = 115.93 MB/sec
>> Timing buffered disk reads: 64 MB in 3.12 seconds = 20.54 MB/sec
>> pata_via average result @533MHz:
>> /dev/sda:
>> Timing cached reads: 234 MB in 2.01 seconds = 116.27 MB/sec
>> Timing buffered disk reads: 82 MB in 3.05 seconds = 26.92 MB/sec
>
> Interesting! I haven't tried libata myself on that system, I only have
> remote access to it so I'm a bit afraid...
Just change root=/dev/hda1 to append="root=/dev/sda1" in lilo.conf.
And change fstab. If You don't change "/" then system will go into single
user mode after reboot.
> Rafal, I hope that system you run hdparm on isn't the archlinux one! Is
> it easy to load an old kernel (even two years old) and do the same test?
> If it is, please let me know of the results.
I don't think it is possible. If I remember right kernel can't be older
then glibc kernel headers.
Btw. My disk is 20GB 2,5" ATA33. I wonder how 26MB/s is possible. I don't
expect more.
>
> Thanks in advance,
> Dimitris
Regards
Rafał
----------------------------------------------------------------------
Wszystko czego potrzebujesz latem: kremy do opalania,
stroje kapielowe, maly romans
>>>http://link.interia.pl/f1b15
On Tuesday 07 August 2007 12:03:28 Rafał Bilski wrote:
> >> Just tested (plain curiosity).
> >> via82cxxx average result @533MHz:
> >> /dev/hda:
> >> Timing cached reads: 232 MB in 2.00 seconds = 115.93 MB/sec
> >> Timing buffered disk reads: 64 MB in 3.12 seconds = 20.54 MB/sec
> >> pata_via average result @533MHz:
> >> /dev/sda:
> >> Timing cached reads: 234 MB in 2.01 seconds = 116.27 MB/sec
> >> Timing buffered disk reads: 82 MB in 3.05 seconds = 26.92 MB/sec
> >
> > Interesting! I haven't tried libata myself on that system, I only have
> > remote access to it so I'm a bit afraid...
>
> Just change root=/dev/hda1 to append="root=/dev/sda1" in lilo.conf.
> And change fstab. If You don't change "/" then system will go into single
> user mode after reboot.
>
> > Rafal, I hope that system you run hdparm on isn't the archlinux one! Is
> > it easy to load an old kernel (even two years old) and do the same test?
> > If it is, please let me know of the results.
>
> I don't think it is possible. If I remember right kernel can't be older
> then glibc kernel headers.
> Btw. My disk is 20GB 2,5" ATA33. I wonder how 26MB/s is possible. I don't
> expect more.
Aha! So perhaps libata gives even greater performance benefit for better
drives. I'll try to get local access to my PC and try it. Thanks.
Dimitris
On Tuesday 07 August 2007 03:37:08 Alan Cox wrote:
> > > acpi_pm_read is capable of disappearing into SMM traps which will make
> > > it look very slow.
> >
> > what is an SMM trap? I googled a bit but didn't get it...
>
> One of the less documented bits of the PC architecture. It is possible to
> arrange that the CPU jumps into a special mode when triggered by some
> specific external event. Originally this was used for stuff like APM and
> power management but some laptops use it for stuff like faking the
> keyboard interface and the Geode uses it for tons of stuff.
>
> As SMM mode is basically invisible to the OS what oprofile and friends
> see isn't what really occurs. So you see
>
> pci write -> some address
>
> you don't then see
>
> SMM
> CPU saves processor state
> Lots of code runs (eg i2c polling the battery)
> code executes RSM
>
> Back to the OS
>
> and the next visible profile point. This can make an I/O operation look
> really slow even if it isn't the I/O which is slow.
I always thought x86 is becoming a really dirty architecture. I now think it
is even uglier. :-p Thank you for the thorough explanation.
>
> > the reason I'm talking about a "software driver limit" is because I am
> > sure about some facts:
> > - The disks can reach very high speeds (60 MB/s on other systems with
> > udma5)
>
> Is UDMA5 being selected firstly ?
What the kernel selects by default is udma4 (66MB/s). I tried forcing udma5
(100MB/s) with hdparm even though I think my chipset doesn't support it, and
indeed there was a difference! After repetitive tests udma4 gives 20MB/s,
udma5 gives 22MB/s. I'm mostly surprised however that I could even set this
option.
>
> > So what is left? Probably only the corresponding kernel module.
>
> Unlikely to be the disk driver as that really hasn't changed tuning for a
> very long time. I/O scheduler interactions are however very possible.
I'm now trying to use the new libata driver and see what happens...
Thanks,
Dimitris
Hello again,
I 'm now using libata on the same system described before (see attached
dmesg.txt). When writing to both disks I think the problem is now worse
(pata_oprof_bad.txt, pata_vmstat_bad.txt), even the oprofile script needed
half an hour to complete! For completion I also attach the same tests when I
write to only one disk (pata_vmstat_1disk.txt, pata_oprof_1disk.txt), whence
everything is normal.
FWIW, libata did not give me any performance benefit, 20MB/s is again the peak
hdparm reports.
Thanks,
Dimitris
> Hello again,
Hi,
> I 'm now using libata on the same system described before (see attached
> dmesg.txt). When writing to both disks I think the problem is now worse
> (pata_oprof_bad.txt, pata_vmstat_bad.txt), even the oprofile script needed
> half an hour to complete! For completion I also attach the same tests when I
> write to only one disk (pata_vmstat_1disk.txt, pata_oprof_1disk.txt), whence
> everything is normal.
>
> FWIW, libata did not give me any performance benefit, 20MB/s is again the peak
> hdparm reports.
This OProfile thing is extremly not usefull in this case. It says that Your
system is using 25% CPU time for no-op loops, but it doesn't say why. Your
system really isn't very busy. Look here:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 2 0 225352 5604 91700 0 0 18112 1664 28145 6315 29 71 0 0
2 2 0 225360 5604 91700 0 0 18496 1664 27992 6358 30 70 0 0
1 2 0 225360 5604 91700 0 0 18432 1472 28511 6315 28 72 0 0
1 2 0 225360 5604 91700 0 0 18240 1536 28031 6153 31 69 0 0
+ video 720x576 25fps yuv stream over PCI. And system is fully responsive. Of
course programs which need disk access must wait a bit longer, but later are
running fine.
I don't have disks so fast like Yours and I can't do destructive write test.
First disk:
1 1 0 241848 7312 100768 0 0 27712 0 927 1270 29 13 0 58
1 1 0 241052 7580 100896 0 0 4612 4676 519 702 34 12 0 54
Second disk:
0 1 0 237752 7268 100980 0 0 6464 0 468 583 37 10 0 53
0 1 0 241060 7532 100884 0 0 1728 1728 465 578 31 9 0 60
Both:
0 2 0 241592 7384 100776 0 0 33024 0 905 1415 33 16 0 51
1 2 0 240804 7528 100884 0 0 6848 6848 642 780 38 10 0 52
So sda + sdc = both.
Your single disk:
0 1 0 128804 19620 82484 0 0 0 21120 335 675 0 4 0 96
Both:
5 2 0 168480 10972 47152 0 0 0 16000 252 470 22 78 0 0
I would expect 2*21k, but we have only 2*8k and it is lower then single disk
case. Of course this math isn't moving us forward. Only thing which would
help would be function call trace as Andrew wrote. Which function is calling
delay_tsc()? Is it calling it often or once but with long delay?
So far it looks like some kind of hardware limit for me. Do You have any
options in BIOS which can degrade PCI or disk performance?
>
> Thanks,
> Dimitris
Rafa?
----------------------------------------------------------------------
Wszystko czego potrzebujesz latem: kremy do opalania,
stroje kapielowe, maly romans
>>>http://link.interia.pl/f1b15
Hi Rafal, thank you for your help!
On Wednesday 08 August 2007 22:08:18 Rafa? Bilski wrote:
> > Hello again,
>
> Hi,
>
> > I 'm now using libata on the same system described before (see attached
> > dmesg.txt). When writing to both disks I think the problem is now worse
> > (pata_oprof_bad.txt, pata_vmstat_bad.txt), even the oprofile script
> > needed half an hour to complete! For completion I also attach the same
> > tests when I write to only one disk (pata_vmstat_1disk.txt,
> > pata_oprof_1disk.txt), whence everything is normal.
> >
> > FWIW, libata did not give me any performance benefit, 20MB/s is again the
> > peak hdparm reports.
>
> This OProfile thing is extremly not usefull in this case. It says that Your
> system is using 25% CPU time for no-op loops, but it doesn't say why. Your
> system really isn't very busy. Look here:
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu---- r b swpd free buff cache si so bi bo in
> cs us sy id wa 2 2 0 225352 5604 91700 0 0 18112 1664 28145
> 6315 29 71 0 0 2 2 0 225360 5604 91700 0 0 18496 1664
> 27992 6358 30 70 0 0 1 2 0 225360 5604 91700 0 0 18432
> 1472 28511 6315 28 72 0 0 1 2 0 225360 5604 91700 0 0
> 18240 1536 28031 6153 31 69 0 0 + video 720x576 25fps yuv stream over
> PCI. And system is fully responsive. Of course programs which need disk
> access must wait a bit longer, but later are running fine.
> I don't have disks so fast like Yours and I can't do destructive write
> test. First disk:
> 1 1 0 241848 7312 100768 0 0 27712 0 927 1270 29 13 0
> 58 1 1 0 241052 7580 100896 0 0 4612 4676 519 702 34 12
> 0 54 Second disk:
> 0 1 0 237752 7268 100980 0 0 6464 0 468 583 37 10 0
> 53 0 1 0 241060 7532 100884 0 0 1728 1728 465 578 31 9
> 0 60 Both:
> 0 2 0 241592 7384 100776 0 0 33024 0 905 1415 33 16 0
> 51 1 2 0 240804 7528 100884 0 0 6848 6848 642 780 38 10
> 0 52 So sda + sdc = both.
>
> Your single disk:
> 0 1 0 128804 19620 82484 0 0 0 21120 335 675 0 4 0
> 96 Both:
> 5 2 0 168480 10972 47152 0 0 0 16000 252 470 22 78 0
> 0 I would expect 2*21k, but we have only 2*8k and it is lower then single
> disk case. Of course this math isn't moving us forward. Only thing which
> would help would be function call trace as Andrew wrote. Which function is
> calling delay_tsc()? Is it calling it often or once but with long delay?
Please guide me on doing these call traces. Unfortunately I don't have any
experience at all regarding kernel-space debugging. What tools should I use?
Keep in mind also that I mostly have remote access to this PC.
However I am not really confident that the cause is delay_tsc(). It only
appeared on the libata tests. To summarize I will write down the first lines
of all problematic oprofilings I've done until now:
first test:
3832 23.4818 __switch_to
3380 20.7121 schedule
653 4.0015 mask_and_ack_8259A
452 2.7698 __blockdev_direct_IO
327 2.0038 dequeue_task
second test, with idle=poll (which didn't really help, it seems the cpu is not
idling at all):
2477 31.2871 __switch_to
1625 20.5255 schedule
246 3.1072 mask_and_ack_8259A
222 2.8041 __blockdev_direct_IO
147 1.8568 follow_page
third test, libata and idle=poll (the problem felt like it was greater here,
oprofiling took a really long time):
9556 26.0523 delay_tsc
6424 17.5136 iowrite8
6178 16.8430 __switch_to
4015 10.9460 schedule
3335 9.0921 ioread8
1429 3.8959 iowrite32
So I would assume that delay_tsc() probably only makes the situation worse for
the libata tests, but the real problem is at __switch_to() and schedule(). Do
you agree with these assumptions? What are these functions used for? My guess
would be that __switch_to() is for context switching and schedule() for
process scheduling. However the context switch rate, as reported by vmstat,
is not that great to verify the time lost in __switch_to(). As shown from the
measurements on one disk the context switching rate is greater, but the time
spent is much less.
Is there a way to for oprofile to report the time spent in function calls
depending on the call trace?
> So far it looks like some kind of hardware limit for me. Do You have any
> options in BIOS which can degrade PCI or disk performance?
None that I can think of. Do you have any specific options in mind?
Thanks again for the help. I guess if that doesn't lead anywhere I'll just
start compiling older vanilla kernels and see when the problem dissapears.
But this needs a lot of time and I'm not sure for how long I'll be able to
not offer any service with those disks (I was supposed to use RAID 0 to
provide storage space with them, but with the current problems that wouldn't
be so wise).
Dimitris
>
> > Thanks,
> > Dimitris
>
> Rafa?
> So I would assume that delay_tsc() probably only makes the situation worse for
> the libata tests, but the real problem is at __switch_to() and schedule(). Do
> you agree with these assumptions?
Yes. I agree that percentage of CPU time is unreasonably high for these
functions. But not only for them.
> Is there a way to for oprofile to report the time spent in function calls
> depending on the call trace?
I don't know any.
> Thanks again for the help. I guess if that doesn't lead anywhere I'll just
> start compiling older vanilla kernels and see when the problem dissapears.
> But this needs a lot of time and I'm not sure for how long I'll be able to
> not offer any service with those disks (I was supposed to use RAID 0 to
> provide storage space with them, but with the current problems that wouldn't
> be so wise).
Probably this is the best and fastest way to diagnose problem.
> Dimitris
Regards
Rafa?
----------------------------------------------------------------------
Ta?m Renaty i Andrzeja tu nie znajdziesz.
Mamy co? lepszego ...
>>>http://link.interia.pl/f1b41
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
8 1 0 8260 64260 117276 0 0 18 2481 239 279 44 7 44 5
8 2 0 8200 64260 117276 0 0 0 15920 361 510 89 11 0 0
9 2 0 8200 64268 117276 0 0 0 16100 362 511 89 11 0 0
9 2 0 8200 64268 117276 0 0 0 16000 353 495 85 15 0 0
8 2 0 8200 64268 117276 0 0 0 16000 352 502 90 10 0 0
9 2 0 8200 64268 117276 0 0 0 15872 350 494 88 12 0 0
8 2 0 8200 64268 117276 0 0 0 15872 350 497 91 9 0 0
9 2 0 8200 64268 117276 0 0 0 15872 353 471 75 25 0 0
8 2 0 8200 64276 117276 0 0 0 15956 356 521 88 12 0 0
9 2 0 8200 64276 117276 0 0 0 16000 353 516 92 8 0 0