hi
I've got this video server serving video for VoD. problem is the P4 1.8 seems
to be maxed out by a few system calls. The below output is for ~50 clients
streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU maxes out.
Does anyone have an idea?
bash-2.05# readprofile | sort -rn +2 | head -30
154203 default_idle 2409.4219
212723 csum_partial_copy_generic 916.9095
100164 handle_IRQ_event 695.5833
24979 system_call 390.2969
37300 e1000_intr 388.5417
119699 ide_intr 340.0540
30598 skb_release_data 273.1964
40740 do_softirq 195.8654
131818 do_wp_page 164.7725
9935 fget 155.2344
24747 kfree 154.6687
10911 del_timer 113.6562
11683 ip_conntrack_find_get 91.2734
4120 sock_poll 85.8333
9357 ip_ct_find_proto 83.5446
5194 sock_wfree 81.1562
4929 add_wait_queue 77.0156
8361 flush_tlb_page 74.6518
4571 remove_wait_queue 71.4219
2191 __brelse 68.4688
29477 skb_clone 68.2338
8562 do_gettimeofday 59.4583
5673 process_timeout 59.0938
11097 tcp_v4_send_check 57.7969
6124 kfree_skbmem 54.6786
17115 tcp_poll 53.4844
21130 nf_hook_slow 52.8250
8299 ip_ct_refresh 51.8687
15429 __kfree_skb 50.7533
1059 lru_cache_del 46.0435
roy
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
> I've got this video server serving video for VoD. problem is the P4 1.8
> seems to be maxed out by a few system calls. The below output is for ~50
> clients streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU
> maxes out.
>
> Does anyone have an idea?
...adding the whole profile output - sorted by the first column this time...
905182 total 0.4741
121426 csum_partial_copy_generic 474.3203
93633 default_idle 1800.6346
74665 do_wp_page 111.1086
65857 ide_intr 184.9916
53636 handle_IRQ_event 432.5484
21973 do_softirq 107.7108
20498 e1000_intr 244.0238
19800 do_page_fault 16.8081
19395 skb_clone 45.7429
14564 system_call 260.0714
13592 kfree 89.4211
13557 skb_release_data 116.8707
13025 ide_do_request 17.6970
12988 do_rw_disk 8.4557
11841 tcp_sendmsg 2.6814
11720 nf_hook_slow 29.0099
11712 tcp_poll 34.0465
10688 schedule 7.8588
10386 __kfree_skb 34.1645
10052 ipt_do_table 10.1741
8286 fget 115.0833
7436 tcp_v4_send_check 44.2619
7191 e1000_clean_tx_irq 16.6458
7031 kmalloc 18.1211
6610 tcp_write_xmit 9.3892
6241 tcp_clean_rtx_queue 8.0425
6232 ip_conntrack_find_get 51.9333
6140 ide_dmaproc 8.4341
6125 tcp_packet 14.0482
5858 qdisc_restart 15.4158
5734 e1000_xmit_frame 5.6660
5709 tcp_v4_rcv 3.7363
5703 sys_rt_sigprocmask 11.4060
5445 tcp_transmit_skb 3.7500
5273 alloc_skb 11.8761
4961 ide_wait_stat 18.7917
4790 ip_ct_find_proto 44.3519
4782 add_timer 18.3923
4760 ip_ct_refresh 29.7500
4729 do_anonymous_page 17.6455
4616 e1000_clean_rx_irq 4.9106
4464 do_gettimeofday 37.2000
4359 flush_tlb_page 38.9196
4209 ip_finish_output2 16.4414
3731 get_hash_table 23.3188
3714 eth_type_trans 21.1023
3712 __make_request 2.3375
3680 __ip_conntrack_find 12.7778
3480 ip_route_input 9.1579
3363 kfree_skbmem 32.3365
3295 __switch_to 15.2546
3205 fput 13.1352
3143 rmqueue 5.3452
3137 ip_conntrack_in 5.0272
3008 sync_timers 250.6667
2861 sock_wfree 47.6833
2580 ip_queue_xmit 2.0347
2578 process_timeout 26.8542
2577 netif_rx 6.1357
2555 get_user_pages 6.2623
2504 sock_poll 62.6000
2346 ide_build_sglist 5.6394
2316 brw_kiovec 2.5619
2256 csum_partial 7.8333
2251 ip_queue_xmit2 4.2958
2198 start_request 4.0404
2186 dev_queue_xmit 2.7462
2167 timer_bh 2.2203
2162 __free_pages_ok 3.1608
2157 zap_page_range 2.4400
1942 mark_dirty_kiobuf 21.1087
1733 process_backlog 5.9349
1719 tcp_rcv_established 0.8493
1689 add_wait_queue 32.4808
1650 mod_timer 6.1567
1603 wait_kio 17.4239
1575 net_rx_action 4.8611
1554 get_pid 4.0052
1434 lru_cache_add 12.8036
1429 handle_mm_fault 7.7663
1397 ip_local_deliver_finish 4.5357
1357 nf_iterate 10.2803
1350 e1000_alloc_rx_buffers 5.2734
1298 do_select 2.5155
1268 unlock_page 12.1923
1209 submit_bh 10.7946
1184 add_entropy_words 5.9200
1175 __brelse 36.7188
1125 __pollwait 7.8125
1108 shrink_list 5.7708
1099 generic_make_request 3.6151
1080 __free_pages 33.7500
1052 tcp_ack 1.2524
1020 ip_rcv 1.0851
986 raid0_make_request 2.9345
898 ext3_direct_io_get_block 4.7766
883 pfifo_fast_dequeue 11.6184
863 sys_gettimeofday 5.5321
828 tcp_ack_update_window 3.9808
813 ipt_local_out_hook 7.8173
761 __lru_cache_del 6.5603
756 sys_write 2.9531
742 __rdtsc_delay 26.5000
730 uhci_interrupt 3.3182
718 net_tx_action 2.5643
710 batch_entropy_store 3.9444
701 add_timer_randomness 3.3066
666 tasklet_hi_action 4.1625
662 sys_nanosleep 1.7796
635 set_page_dirty 5.4741
627 __tcp_data_snd_check 3.1350
611 netif_receive_skb 1.9094
601 pfifo_fast_enqueue 5.3661
590 del_timer_sync 4.3382
587 lru_cache_del 26.6818
574 get_unmapped_area 2.1103
561 wait_for_tcp_memory 0.7835
557 ip_refrag 5.8021
546 ip_conntrack_local 6.2045
515 sys_select 0.4486
507 __tcp_select_window 2.2634
481 ext3_get_branch 2.2689
433 ip_output 1.2301
389 ip_confirm 9.7250
384 find_vma 4.5714
379 set_bh_page 9.4750
376 tcp_v4_do_rcv 1.0108
370 tcp_ack_no_tstamp 1.4453
368 batch_entropy_process 2.0000
365 ide_build_dmatable 1.1551
364 ip_rcv_finish 0.7696
363 kmem_cache_free 2.8359
362 __wake_up 1.8854
338 ext3_get_block_handle 0.5152
336 inet_sendmsg 5.2500
318 bh_action 2.3382
298 tcp_data_queue 0.1064
272 md_make_request 2.6154
272 ext3_block_to_path 0.9714
248 sock_sendmsg 1.8235
244 __alloc_pages 0.6932
238 kmem_cache_alloc 0.7532
226 __free_pte 3.1389
225 tcp_ack_probe 1.3720
224 __run_task_queue 2.0741
213 ide_get_queue 5.3250
209 __ip_ct_find_proto 4.3542
202 get_sample_stats 1.7414
195 tcp_write_space 1.6810
185 schedule_timeout 1.1859
184 do_signal 0.2992
182 ipt_hook 4.5500
171 generic_direct_IO 0.5552
168 can_share_swap_page 1.8261
157 ip_local_deliver 0.3965
155 get_conntrack_index 2.7679
145 tcp_push_one 0.5664
145 ide_set_handler 1.2500
144 pipe_poll 1.4400
143 max_select_fd 0.8938
139 pte_alloc 0.5792
138 del_timer 1.6429
136 sock_write 0.7234
131 poll_freewait 1.9265
131 getblk 1.7237
130 send_sig_info 0.8553
128 __release_sock 1.4545
125 ret_from_sys_call 7.3529
120 ext3_direct_IO 0.1807
118 tcp_pkt_to_tuple 3.6875
117 find_vma_prev 0.6648
114 do_no_page 0.2298
112 tqueue_bh 4.0000
112 follow_page 1.0769
110 bread 1.1000
108 e1000_rx_checksum 1.2273
107 generic_file_direct_IO 0.1938
102 add_interrupt_randomness 2.5500
97 remove_wait_queue 1.7321
96 mark_page_accessed 2.0000
91 kill_something_info 0.2645
85 invert_tuple 1.9318
81 exit_notify 0.1151
81 cpu_idle 0.9643
80 tcp_new_space 0.6061
79 nf_register_queue_handler 0.5197
75 uhci_remove_pending_qhs 0.3906
69 pdc202xx_dmaproc 0.1250
68 sys_read 0.2656
68 nf_reinject 0.1491
66 map_user_kiobuf 0.2619
65 find_vma_prepare 0.6500
64 generic_file_read 0.2353
61 check_pgt_cache 2.5417
60 free_pages 1.8750
58 error_code 0.9667
57 vm_enough_memory 0.5481
56 __delay 1.4000
55 __const_udelay 1.0577
53 tcp_ioctl 0.0908
53 journal_commit_transaction 0.0132
53 do_munmap 0.0901
52 _alloc_pages 2.1667
51 uhci_finish_completion 0.4554
51 credit_entropy_store 1.1591
50 rh_report_status 0.1953
50 free_page_and_swap_cache 0.8929
49 sys_rt_sigsuspend 0.1750
49 nr_free_pages 0.6125
49 do_mmap_pgoff 0.0394
48 e1000_update_stats 0.0307
48 do_get_write_access 0.0366
48 __journal_file_buffer 0.0916
48 __get_free_pages 2.0000
48 .text.lock.e1000_main 1.7143
47 expand_kiobuf 0.3092
46 uhci_free_pending_qhs 0.4600
46 tcp_parse_options 0.0833
46 kmem_cache_size 5.7500
45 rb_erase 0.2083
44 unmap_kiobuf 0.6111
41 tcp_cwnd_application_limited 0.3106
41 rh_int_timer_do 0.1165
41 init_or_cleanup 0.1424
40 sync_unlocked_inodes 0.0901
40 init_buffer 1.4286
39 .text.lock.ip_input 1.0000
38 vma_merge 0.1301
38 pfifo_fast_requeue 0.6786
38 ip_conntrack_get 0.9500
38 dev_watchdog 0.2209
37 .text.lock.ip_output 0.2803
36 do_check_pgt_cache 0.1731
35 tcp_retrans_try_collapse 0.0576
35 journal_add_journal_head 0.1306
34 ext3_get_inode_loc 0.0914
33 journal_write_revoke_records 0.1964
32 fsync_buffers_list 0.0860
31 filemap_fdatasync 0.1615
31 __pmd_alloc 1.5500
30 sys_wait4 0.0305
30 restore_sigcontext 0.0949
29 sys_sigreturn 0.1169
28 tcp_fastretrans_alert 0.0224
28 do_settimeofday 0.1628
28 do_ide_request 1.4000
27 unmap_fixup 0.0785
27 find_extend_vma 0.1350
27 eth_header_parse 0.8438
27 current_capacity 0.6750
26 save_i387 0.0478
26 __journal_clean_checkpoint_list 0.2407
25 update_atime 0.3125
25 tcp_v4_destroy_sock 0.0718
25 link_path_walk 0.0102
25 buffer_insert_inode_queue 0.2841
25 __journal_unfile_buffer 0.0665
24 sys_mmap2 0.1622
24 rh_send_irq 0.0896
24 rb_insert_color 0.1224
24 ext3_do_update_inode 0.0261
24 balance_dirty_state 0.3158
24 add_wait_queue_exclusive 0.4615
24 __try_to_free_cp_buf 0.4000
23 free_kiobuf_bhs 0.2396
22 tcp_rcv_synsent_state_process 0.0169
22 sys_munmap 0.2619
22 start_this_handle 0.0598
22 sock_rfree 1.3750
22 setup_sigcontext 0.0743
22 flush_tlb_mm 0.1964
22 do_exit 0.0301
22 alloc_kiobuf_bhs 0.1170
22 __rb_erase_color 0.0567
21 tcp_mem_schedule 0.0477
21 setup_frame 0.0482
21 __generic_copy_to_user 0.3500
20 unlock_buffer 0.3125
20 journal_write_metadata_buffer 0.0240
20 d_lookup 0.0704
20 copy_skb_header 0.0980
19 sync_old_buffers 0.1218
19 sock_mmap 0.4750
19 skb_split 0.0344
19 select_bits_alloc 0.7917
19 get_info_ptr 0.2065
17 tcp_write_wakeup 0.0363
17 ret_from_exception 0.6800
17 kiobuf_wait_for_io 0.1062
17 journal_unlock_journal_head 0.1518
17 bad_signal 0.1250
16 tcp_probe_timer 0.0952
16 tcp_close 0.0083
16 ip_route_output_slow 0.0099
16 __mark_inode_dirty 0.0952
16 .text.lock.timer 0.1250
16 .text.lock.tcp 0.0152
15 journal_cancel_revoke 0.0765
15 ext3_bmap 0.1500
15 do_fork 0.0074
15 blk_grow_request_list 0.0833
14 tcp_v4_conn_request 0.0145
14 sync_supers 0.0507
14 log_start_commit 0.0946
14 lock_vma_mappings 0.3500
14 journal_dirty_metadata 0.0354
14 file_read_actor 0.0625
14 __insert_vm_struct 0.1400
13 tcp_time_to_recover 0.0290
13 sys_ioctl 0.0259
13 lookup_swap_cache 0.1625
13 ip_build_xmit_slow 0.0099
13 invalidate_inode_pages 0.0739
13 ext3_dirty_inode 0.0478
13 bmap 0.2955
12 tcp_collapse 0.0143
12 sys_socketcall 0.0234
12 put_filp 0.1364
12 make_pages_present 0.0968
12 journal_get_write_access 0.1304
12 generic_file_write 0.0061
12 e1000_ioctl 0.3333
11 uhci_transfer_result 0.0316
11 tcp_try_to_open 0.0348
11 tcp_recvmsg 0.0045
11 tcp_create_openreq_child 0.0092
11 sys_kill 0.1250
11 schedule_tail 0.0786
11 osync_buffers_list 0.0859
11 journal_stop 0.0255
11 do_sigpending 0.0887
10 tcp_unhash 0.0397
10 tcp_send_probe0 0.0424
10 tcp_rcv_state_process 0.0040
10 sys_poll 0.0138
10 inet_shutdown 0.0208
10 execute_drive_cmd 0.0221
10 __put_unused_buffer_head 0.1136
9 tcp_write_timer 0.0395
9 tcp_send_skb 0.0191
9 tcp_make_synack 0.0082
9 set_buffer_flushtime 0.4500
9 raid0_status 0.2045
9 copy_page_range 0.0205
8 kupdate 0.0274
8 journal_get_descriptor_buffer 0.0741
8 get_empty_filp 0.0253
8 ext3_write_super 0.0741
8 count_active_tasks 0.1111
8 atomic_dec_and_lock 0.1111
8 __lock_page 0.0400
8 __journal_remove_journal_head 0.0250
8 __ip_conntrack_confirm 0.0115
8 __block_prepare_write 0.0105
7 tcp_invert_tuple 0.2188
7 ports_active 0.1346
7 pipe_write 0.0112
7 kjournald 0.0130
7 handle_signal 0.0273
7 grow_buffers 0.0254
7 ext3_get_block 0.0700
7 balance_classzone 0.0151
7 __jbd_kmalloc 0.0625
7 .text.lock.swap 0.1296
6 vsnprintf 0.0057
6 tcp_v4_send_reset 0.0176
6 tcp_accept 0.0105
6 sleep_on 0.0500
6 select_bits_free 0.3750
6 pipe_read 0.0118
6 number 0.0055
6 ip_route_output_key 0.0165
6 inet_accept 0.0136
6 get_unused_buffer_head 0.0375
6 dput 0.0176
6 cleanup_rbuf 0.0273
6 __journal_remove_checkpoint 0.0556
6 __journal_drop_transaction 0.0087
6 __find_get_page 0.0938
6 .text.lock.netfilter 0.0260
5 vmtruncate_list 0.0625
5 vfs_permission 0.0208
5 tcp_v4_hnd_req 0.0147
5 tcp_init_cwnd 0.0500
5 tcp_check_urg 0.0158
5 tcp_check_sack_reneging 0.0240
5 sys_fork 0.1786
5 sock_setsockopt 0.0034
5 sock_init_data 0.0161
5 sock_def_readable 0.0521
5 release_x86_irqs 0.0595
5 release_task 0.0109
5 refile_buffer 0.1389
5 pipe_release 0.0368
5 path_init 0.0129
5 nr_free_buffer_pages 0.0625
5 mprotect_fixup 0.0043
5 log_space_left 0.1562
5 ll_rw_block 0.0119
5 journal_start 0.0272
5 init_bh 0.2083
5 get_zeroed_page 0.1389
5 ext3_commit_write 0.0078
5 e1000_tx_timeout 0.2500
5 do_poll 0.0227
5 bdfind 0.1389
5 add_keyboard_randomness 0.1250
5 __wait_on_buffer 0.0338
5 __vma_link 0.0284
5 __tcp_mem_reclaim 0.0595
5 __rb_rotate_left 0.0781
4 write_profile 0.0244
4 tcp_v4_syn_recv_sock 0.0064
4 tcp_v4_search_req 0.0278
4 tcp_v4_route_req 0.0192
4 tcp_v4_init_sock 0.0169
4 tcp_cwnd_restart 0.0263
4 tcp_check_req 0.0043
4 tcp_check_reno_reordering 0.0500
4 sys_mprotect 0.0078
4 strncpy_from_user 0.0500
4 sock_def_wakeup 0.0625
4 sock_alloc 0.0208
4 skb_copy_datagram_iovec 0.0071
4 lookup_mnt 0.0476
4 locks_remove_posix 0.0096
4 invalidate_inode_buffers 0.0370
4 init_conntrack 0.0043
4 halfMD4Transform 0.0068
4 find_or_create_page 0.0164
4 filp_close 0.0238
4 ext3_reserve_inode_write 0.0233
4 ext3_find_goal 0.0213
4 do_fcntl 0.0059
4 dnotify_flush 0.0345
4 d_alloc 0.0105
4 add_blkdev_randomness 0.0526
4 _stext 0.0500
4 __journal_insert_checkpoint 0.0167
4 __find_lock_page_helper 0.0323
4 .text.lock.inode 0.0086
3 wait_for_tcp_connect 0.0054
3 tcp_v4_get_port 0.0045
3 tcp_put_port 0.0150
3 tcp_init_xmit_timers 0.0221
3 tcp_clear_xmit_timers 0.0234
3 tcp_add_reno_sack 0.0357
3 sys_sched_getscheduler 0.0288
3 sys_fcntl64 0.0221
3 sys_accept 0.0119
3 sock_ioctl 0.0268
3 sock_fasync 0.0038
3 sock_def_error_report 0.0312
3 rt_check_expire__thr 0.0077
3 rh_init_int_timer 0.0278
3 reset_hc 0.0167
3 register_gifconf 0.0938
3 read_chan 0.0016
3 put_unused_buffer_head 0.0833
3 pipe_ioctl 0.0375
3 permission 0.0227
3 open_namei 0.0024
3 mm_release 0.0833
3 locks_remove_flock 0.0163
3 ksoftirqd 0.0153
3 journal_file_buffer 0.0682
3 iput 0.0060
3 ip_build_and_send_pkt 0.0067
3 interruptible_sleep_on 0.0250
3 inet_sock_destruct 0.0080
3 inet_ioctl 0.0079
3 inet_create 0.0048
3 immediate_bh 0.1071
3 get_unused_fd 0.0077
3 get_empty_inode 0.0179
3 flush_tlb_all_ipi 0.0395
3 filemap_fdatawait 0.0214
3 fd_install 0.0441
3 ext3_prepare_write 0.0056
3 ext3_mark_iloc_dirty 0.0357
3 e1000_watchdog 0.0064
3 e1000_read_phy_reg 0.0179
3 d_invalidate 0.0214
3 create_buffers 0.0125
3 cp_new_stat64 0.0095
3 copy_mm 0.0040
3 copy_files 0.0043
3 bdget 0.0078
3 __insert_into_lru_list 0.0300
3 __global_restore_flags 0.0417
3 __get_user_4 0.1250
2 write_ldt 0.0037
2 walk_page_buffers 0.0161
2 tcp_try_undo_partial 0.0093
2 tcp_try_undo_dsack 0.0294
2 tcp_send_ack 0.0100
2 tcp_retransmit_skb 0.0034
2 tcp_new 0.0333
2 tcp_init_metrics 0.0063
2 tcp_fragment 0.0029
2 tcp_fixup_sndbuf 0.0455
2 tcp_enter_loss 0.0051
2 tcp_destroy_sock 0.0043
2 tcp_close_state 0.0104
2 tcp_child_process 0.0134
2 tcp_bucket_create 0.0263
2 tasklet_init 0.0500
2 sys_close 0.0179
2 sock_recvmsg 0.0116
2 sock_map_fd 0.0052
2 sk_free 0.0172
2 sk_alloc 0.0208
2 sem_exit 0.0038
2 reschedule 0.1667
2 put_files_struct 0.0109
2 path_release 0.0417
2 path_lookup 0.0556
2 mmput 0.0172
2 kiobuf_init 0.0238
2 journal_unfile_buffer 0.0556
2 journal_get_undo_access 0.0070
2 journal_dirty_data 0.0047
2 ip_mc_drop_socket 0.0156
2 idedisk_open 0.0156
2 grow_dev_page 0.0122
2 getname 0.0128
2 generic_unplug_device 0.0333
2 generic_file_llseek 0.0135
2 free_kiovec 0.0200
2 flush_signal_handlers 0.0333
2 filemap_nopage 0.0040
2 ext3_writepage_trans_blocks 0.0152
2 ext3_getblk 0.0030
2 do_generic_file_read 0.0017
2 destroy_inode 0.0455
2 deliver_to_old_ones 0.0114
2 copy_namespace 0.0023
2 clear_inode 0.0122
2 clean_inode 0.0109
2 block_prepare_write 0.0179
2 alloc_kiovec 0.0161
2 add_page_to_hash_queue 0.0455
2 activate_page 0.0139
2 __tcp_v4_lookup_listener 0.0208
2 __journal_refile_buffer 0.0088
2 __generic_copy_from_user 0.0227
2 __find_lock_page 0.0500
2 __down_trylock 0.0263
2 __down_failed_trylock 0.1667
2 __block_commit_write 0.0098
2 .text.lock.sched 0.0042
1 vt_console_device 0.0250
1 vgacon_save_screen 0.0114
1 udp_sendmsg 0.0010
1 tty_write 0.0015
1 tty_ioctl 0.0011
1 tcp_xmit_retransmit_queue 0.0010
1 tcp_xmit_probe_skb 0.0086
1 tcp_v4_synq_add 0.0063
1 tcp_v4_rebuild_header 0.0028
1 tcp_timewait_kill 0.0045
1 tcp_sync_mss 0.0081
1 tcp_reset_keepalive_timer 0.0250
1 tcp_reset 0.0039
1 tcp_recv_urg 0.0044
1 tcp_incr_quickack 0.0167
1 tcp_error 0.0139
1 sys_time 0.0119
1 sys_stat64 0.0086
1 sys_modify_ldt 0.0106
1 sys_lstat64 0.0089
1 sys_llseek 0.0034
1 sys_getppid 0.0250
1 sys_getpeername 0.0081
1 sys_fstat64 0.0104
1 sys_clone 0.0250
1 sys_brk 0.0042
1 sys_access 0.0034
1 svc_udp_recvfrom 0.0014
1 sock_wmalloc 0.0125
1 sock_release 0.0104
1 sock_read 0.0064
1 sock_create 0.0036
1 skb_recv_datagram 0.0042
1 show_mem 0.0033
1 setup_rt_frame 0.0015
1 setscheduler 0.0024
1 secure_tcp_sequence_number 0.0051
1 restart_request 0.0132
1 remove_inode_page 0.0192
1 remove_expectations 0.0208
1 proc_pid_lookup 0.0020
1 proc_lookup 0.0068
1 pdc202xx_reset 0.0074
1 path_walk 0.0357
1 opost 0.0023
1 old_mmap 0.0033
1 normal_poll 0.0035
1 nfs3svc_encode_attrstat 0.0020
1 n_tty_receive_buf 0.0002
1 move_addr_to_user 0.0119
1 mm_init 0.0051
1 memory_open 0.0050
1 kmem_cache_grow 0.0018
1 kill_fasync 0.0172
1 journal_free_journal_head 0.0500
1 journal_bmap 0.0089
1 journal_blocks_per_page 0.0312
1 journal_alloc_journal_head 0.0096
1 is_read_only 0.0147
1 ip_ct_gather_frags 0.0031
1 init_private_file 0.0093
1 init_once 0.0038
1 init_buffer_head 0.0182
1 inet_release 0.0125
1 inet_getname 0.0083
1 inet_autobind 0.0023
1 get_pipe_inode 0.0057
1 free_pgtables 0.0071
1 fn_hash_lookup 0.0045
1 find_inlist_lock 0.0035
1 file_move 0.0139
1 fcntl_dirnotify 0.0032
1 ext3_write_inode 0.0192
1 ext3_test_allocatable 0.0156
1 ext3_release_file 0.0357
1 ext3_read_inode 0.0014
1 ext3_open_file 0.0250
1 ext3_group_sparse 0.0104
1 ext3_file_write 0.0053
1 exit_sighand 0.0100
1 e1000_tbi_adjust_stats 0.0021
1 e1000_check_for_link 0.0020
1 do_timer 0.0125
1 do_tcp_sendpages 0.0004
1 do_sys_settimeofday 0.0064
1 do_readv_writev 0.0016
1 do_pollfd 0.0074
1 death_by_timeout 0.0068
1 d_instantiate 0.0139
1 cpu_raise_softirq 0.0154
1 copy_thread 0.0071
1 clear_page_tables 0.0046
1 clean_from_lists 0.0139
1 check_unthrottle 0.0208
1 change_protection 0.0027
1 cached_lookup 0.0119
1 add_to_page_cache_locked 0.0081
1 __user_walk 0.0156
1 __remove_inode_page 0.0104
1 __remove_from_lru_list 0.0119
1 __refile_buffer 0.0109
1 __rb_rotate_right 0.0156
1 __loop_delay 0.0250
1 .text.lock.super 0.0071
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
On Wed, Oct 23, 2002 at 01:06:18PM +0200, Roy Sigurd Karlsbakk wrote:
> > I've got this video server serving video for VoD. problem is the P4 1.8
> > seems to be maxed out by a few system calls. The below output is for ~50
> > clients streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU
> > maxes out.
'50 clients *each* streaming at ~4.4MBps', better make that clear, otherwise
something is *very* broken. Also mention that you have an e1000 card which
does not do outgoing checksumming.
You'd think that a kernel would be able to do 250megabits of TCP checksums
though.
> ...adding the whole profile output - sorted by the first column this time...
>
> 905182 total 0.4741
> 121426 csum_partial_copy_generic 474.3203
> 93633 default_idle 1800.6346
> 74665 do_wp_page 111.1086
Perhaps the 'copy' also entails grabbing the page from disk, leading to
inflated csum_partial_copy_generic stats?
Where are you serving from?
Regards,
bert
--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
On Wed, 2002-10-23 at 06:01, bert hubert wrote:
> Also mention that you have an e1000 card which
> does not do outgoing checksumming.
The e1000 can very well do hardware checksumming on transmit.
The missing piece of the puzzle is that his application is not
using sendfile(), without which no transmit checksum offload
can take place.
> >
> > 905182 total 0.4741
> > 121426 csum_partial_copy_generic 474.3203
>
> Well, maybe take a look at this func and try to optimize it?
I don't know assembly that good - sorry.
> > 93633 default_idle 1800.6346
> > 74665 do_wp_page 111.1086
>
> What's this?
do_wp_page is Defined as a function in: mm/memory.c
comments from the file:
/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
* and decrementing the shared-page counter for the old page.
*
* Goto-purists beware: the only reason for goto's here is that it results
* in better assembly code.. The "default" path will see no jumps at all.
*
* Note that this routine assumes that the protection checks have been
* done by the caller (the low-level page fault routine in most cases).
* Thus we can safely just mark it writable once we've done any necessary
* COW.
*
* We also mark the page dirty at this point even though the page will
* change only once the write actually happens. This avoids a few races,
* and potentially makes it more efficient.
*
* We hold the mm semaphore and the page_table_lock on entry and exit
* with the page_table_lock released.
*/
>
> > 65857 ide_intr 184.9916
>
> You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm...
> how large is your readahead? I assume you'd like to fetch
> more sectors from ide per interrupt. (I hope you do DMA ;)
doing DMA - RAID-0 with 1MB chunk size on 4 disks.
> > 53636 handle_IRQ_event 432.5484
> > 21973 do_softirq 107.7108
> > 20498 e1000_intr 244.0238
>
> I know zero about networking, but why 120 000 csum_partial_copy_generic
> and inly 20 000 nic interrupts? That may be abnormal.
sorry
I don't know
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
> The e1000 can very well do hardware checksumming on transmit.
>
> The missing piece of the puzzle is that his application is not
> using sendfile(), without which no transmit checksum offload
> can take place.
As far as I've understood, sendfile() won't do much good with large files. Is
this right?
We're talking of 3-6GB files here ...
roy
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
> '50 clients *each* streaming at ~4.4MBps', better make that clear,
> otherwise something is *very* broken. Also mention that you have an e1000
> card which does not do outgoing checksumming.
just to clearify
s/MBps/Mbps/
s/bps/bits per second/
> You'd think that a kernel would be able to do 250megabits of TCP checksums
> though.
>
> > ...adding the whole profile output - sorted by the first column this
> > time...
> >
> > 905182 total 0.4741
> > 121426 csum_partial_copy_generic 474.3203
> > 93633 default_idle 1800.6346
> > 74665 do_wp_page 111.1086
>
> Perhaps the 'copy' also entails grabbing the page from disk, leading to
> inflated csum_partial_copy_generic stats?
I really don't know. Just to clearify a little more - the server app uses
O_DIRECT to read the data before tossing it to the socket.
> Where are you serving from?
What do you mean?
roy
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
bert hubert wrote:
> > ...adding the whole profile output - sorted by the first column this time...
> >
> > 905182 total 0.4741
> > 121426 csum_partial_copy_generic 474.3203
> > 93633 default_idle 1800.6346
> > 74665 do_wp_page 111.1086
>
> Perhaps the 'copy' also entails grabbing the page from disk, leading to
> inflated csum_partial_copy_generic stats?
I think this is strictly a copy from user space->kernel and vice versa.
This shouldnt include the disk access etc.
thanks,
Nivedita
On Wednesday 23 October 2002 16:59, Nivedita Singhvi wrote:
> bert hubert wrote:
> > > ...adding the whole profile output - sorted by the first column this
> > > time...
> > >
> > > 905182 total 0.4741
> > > 121426 csum_partial_copy_generic 474.3203
> > > 93633 default_idle 1800.6346
> > > 74665 do_wp_page 111.1086
> >
> > Perhaps the 'copy' also entails grabbing the page from disk, leading to
> > inflated csum_partial_copy_generic stats?
>
> I think this is strictly a copy from user space->kernel and vice versa.
> This shouldnt include the disk access etc.
hm
I'm doing O_DIRECT read (from disk), so it needs to be user -> kernel, then.
any chance of using O_DIRECT to the socket?
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
On Wed, Oct 23, 2002 at 03:42:48PM +0200, Roy Sigurd Karlsbakk wrote:
> > The e1000 can very well do hardware checksumming on transmit.
> >
> > The missing piece of the puzzle is that his application is not
> > using sendfile(), without which no transmit checksum offload
> > can take place.
>
> As far as I've understood, sendfile() won't do much good with large files. Is
> this right?
I still refuse to believe that a 1.8GHz Pentium4 can only checksum
250megabits/second. MD Raid5 does better and they probably don't use a
checksum as braindead as that used by TCP.
If the checksumming is not the problem, the copying is, which would be a
weakness of your hardware. The function profiled does both the copying and
the checksumming.
But 250megabits/second also seems low.
Dave?
Regards,
bert
--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
On Wed, 23 Oct 2002, bert hubert wrote:
> On Wed, Oct 23, 2002 at 03:42:48PM +0200, Roy Sigurd Karlsbakk wrote:
> > > The e1000 can very well do hardware checksumming on transmit.
> > >
> > > The missing piece of the puzzle is that his application is not
> > > using sendfile(), without which no transmit checksum offload
> > > can take place.
> >
> > As far as I've understood, sendfile() won't do much good with large files. Is
> > this right?
>
> I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> 250megabits/second. MD Raid5 does better and they probably don't use a
> checksum as braindead as that used by TCP.
>
> If the checksumming is not the problem, the copying is, which would be a
> weakness of your hardware. The function profiled does both the copying and
> the checksumming.
>
> But 250megabits/second also seems low.
>
> Dave?
>
Ordinary DUAL Pentium 400 MHz machine does this...
Calculating CPU speed...done
Testing checksum speed...done
Testing RAM copy...done
Testing I/O port speed...done
CPU Clock = 400 MHz
checksum speed = 685 Mb/s
RAM copy = 1549 Mb/s
I/O port speed = 654 kb/s
This is 685 megaBYTES per second.
checksum speed = 685 Mb/s
Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America
bert hubert wrote:
> I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> 250megabits/second. MD Raid5 does better and they probably don't use a
> checksum as braindead as that used by TCP.
For what it's worth, I have been able to send and receive 400+ Mbps
of traffic, by directional, on the same machine (ie, about 1600 Mbps
of payload across the PCI bus)
So, it's probably not the e1000 or networking code that is slowing you down.
(This was on a 64/66 PCI, Dual-AMD 2Ghz machine though,
are you running only 32/33 PCI? If not, where did you find this motherboard!)
Have you tried just reading the information from disk and doing everying except
the final 'send/write/sendto' ? That would help determine if it is your
file reads that are killing you.
Ben
--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear
bert hubert wrote:
> I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> 250megabits/second. MD Raid5 does better and they probably don't use a
> checksum as braindead as that used by TCP.
>
> If the checksumming is not the problem, the copying is, which would be a
> weakness of your hardware. The function profiled does both the copying and
> the checksumming.
Yep, its not so much the checksumming as the fact that this is
done over each byte of data and copied.
thanks,
Nivedita
On Wed, 23 Oct 2002, Nivedita Singhvi wrote:
> bert hubert wrote:
>
> > I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> > 250megabits/second. MD Raid5 does better and they probably don't use a
> > checksum as braindead as that used by TCP.
> >
> > If the checksumming is not the problem, the copying is, which would be a
> > weakness of your hardware. The function profiled does both the copying and
> > the checksumming.
>
> Yep, its not so much the checksumming as the fact that this is
> done over each byte of data and copied.
>
> thanks,
> Nivedita
No. It's done over each word (short int) and the actual summation
takes place during the address calculation of the next word. This
gets you a checksum that is practically free.
A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second.
It will copy at 1,549 megabytes per second. Those are megaBYTES!
If you have slow network performance it has nothing to do with
either copy or checksum. Data transmission acts like a low-pass
filter. The dominant pole of that transfer function determines
the speed, that's why it's called dominant. If you measure
a data-rate of 10 megabytes/second. Nothing you do with copy
or checksum will affect it to any significant extent.
If you have a data-rate of 100 megabytes per second, then any
tinkering with copy will have an effective improvement ratio
of 100/1,559 ~= 0.064. If you have a data rate of 100 megabytes
per second and you tinker with checksum, you get an improvement
ratio of 100/685 ~=0.14. These are just not the things that are
affecting your performance.
If you were to double the checksumming speed, you increase the
throughput by 2 * 0.14 = 0.28 with the parameters shown.
The TCP/IP checksum is quite nice. It may have been discovered
by accident, but it's still nice. It works regardless of whether
you have a little endian or big endian machine. It also doesn't
wrap so you don't (usually) show a good checksum when the data
is bad. It does have the characteristic that if all the bits are
inverted, it will checksum good. However, there are not too many
real-world scenarios that would result in this inversion. So it's
not "brain-dead" as you state. A hardware checksum is really
quick because it's really easy.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America
"Richard B. Johnson" wrote:
> No. It's done over each word (short int) and the actual summation
> takes place during the address calculation of the next word. This
> gets you a checksum that is practically free.
Yep, sorry, word, not byte. My bad. The cost is in the fact
that this whole process involves loading each word of the data
stream into a register. Which is why I also used to consider
the checksum cost as negligible.
> A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second.
> It will copy at 1,549 megabytes per second. Those are megaBYTES!
But then why the difference in the checksum/copy and copy?
Are you saying the checksum is not costing you 864 megabytes
a second??
thanks,
Nivedita
On Wed, 23 Oct 2002, Nivedita Singhvi wrote:
> "Richard B. Johnson" wrote:
>
> > No. It's done over each word (short int) and the actual summation
> > takes place during the address calculation of the next word. This
> > gets you a checksum that is practically free.
>
> Yep, sorry, word, not byte. My bad. The cost is in the fact
> that this whole process involves loading each word of the data
> stream into a register. Which is why I also used to consider
> the checksum cost as negligible.
>
> > A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second.
> > It will copy at 1,549 megabytes per second. Those are megaBYTES!
>
> But then why the difference in the checksum/copy and copy?
> Are you saying the checksum is not costing you 864 megabytes
> a second??
Costing you 864 megabytes per second?
Lets say the checksum was free. You are then able to INF bytes/per/sec.
So it's costing you INF bytes/per/sec? No, it's costing you nothing.
If we were not dealing with INF, then 'Cost' is approximately 1/N, not
N. Cost is work_done_without_checksum - work_done_with_checksum. Because
of the low-pass filter pole, these numbers are practically the same.
But, you can get a measurable difference between any two large numbers.
This makes the 'cost' seem high. You need to make it relative to make
any sense, so a 'goodness' can be expressed as a ratio of the cost and
the work having been done.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America
On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote:
> As far as I've understood, sendfile() won't do much good with large files. Is
> this right?
There is always a benefit to using sendfile(), when you use
sendfile() the cpu doesn't touch one byte of the data if
the network card support TX checksumming. The disk DMAs
to ram, then the net card DMAs from ram. Simple as that.
On Thursday 24 October 2002 06:11, David S. Miller wrote:
> On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote:
> > As far as I've understood, sendfile() won't do much good with large
> > files. Is this right?
>
> There is always a benefit to using sendfile(), when you use
> sendfile() the cpu doesn't touch one byte of the data if
> the network card support TX checksumming. The disk DMAs
> to ram, then the net card DMAs from ram. Simple as that.
Are there any plans of implementing sendfile64() or sendfile() support for
-D_FILE_OFFSET_BITS=64?
(from man 2 sendfile)
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
int main() {
ssize_t s1;
size_t count;
off_t offset;
printf("sizeof ssize_t: %d\n", sizeof s1);
printf("sizeof size_t: %d\n", sizeof count);
printf("sizeof off_t: %d\n", sizeof offset);
return 0;
}
$ make
...
$ ./sendfile_test
sizeof ssize_t: 4
sizeof size_t: 4
sizeof off_t: 4
$
and - when attempting to build this with -D_FILE_OFFSET_BITS=64
[roy@roy-sin micro_httpd-O_DIRECT]$ make sendfile_test
gcc -D_DEBUG -Wall -W -D_GNU_SOURCE -D_NO_DIR_ACCESS -D_FILE_OFFSET_BITS=64
-D_LARGEFILE_SOURCE -DUSE_O_DIRECT -DINETD -Wno-unused -O0 -ggdb -c
sendfile_test.c
In file included from sendfile_test.c:1:
/usr/include/sys/sendfile.h:26: #error "<sys/sendfile.h> cannot be used with
_FILE_OFFSET_BITS=64"
make: *** [sendfile_test.o] Error 1
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
On Thu, 2002-10-24 at 03:30, Roy Sigurd Karlsbakk wrote:
> Are there any plans of implementing sendfile64() or sendfile() support for
> -D_FILE_OFFSET_BITS=64?
This is old hat, and appears in every current vendor kernel I am
aware of and is in 2.5.x as well.
On Thursday 24 October 2002 12:47, David S. Miller wrote:
> On Thu, 2002-10-24 at 03:30, Roy Sigurd Karlsbakk wrote:
> > Are there any plans of implementing sendfile64() or sendfile() support
> > for -D_FILE_OFFSET_BITS=64?
>
> This is old hat, and appears in every current vendor kernel I am
> aware of and is in 2.5.x as well.
then where can I find these patches? I cannot use 2.5, and I usually try to
stick with an official kernel.
and - if this patch has been around all this time...
why isn't it in the official kernel yet?
--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356
Computers are like air conditioners.
They stop working when you open Windows.
On 23 October 2002 11:36, Roy Sigurd Karlsbakk wrote:
> > > 905182 total 0.4741
> > > 121426 csum_partial_copy_generic 474.3203
> >
> > Well, maybe take a look at this func and try to optimize it?
>
> I don't know assembly that good - sorry.
Well, I like it. Maybe I can look into it. Feel free
to bug me :-)
> > > 93633 default_idle 1800.6346
> > > 74665 do_wp_page 111.1086
> >
> > What's this?
>
> do_wp_page is Defined as a function in: mm/memory.c
>
> comments from the file:
> [snip]
Please delete memory.o, rerun make bzImage, capture gcc
command used for compiling memory.c, modify it:
gcc ... -o memory.o -> gcc ... -S -o memory.s ...
and examine assembler code. Maybe something will stick out
(or use objdump to disassemble memory.o, I recall nice
option to produce assembler output with C code intermixed
as comments!) (send disasmed listing to me offlist).
> > > 65857 ide_intr 184.9916
> >
> > You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm...
> > how large is your readahead? I assume you'd like to fetch
> > more sectors from ide per interrupt. (I hope you do DMA ;)
>
> doing DMA - RAID-0 with 1MB chunk size on 4 disks.
You should aim at maxing out IDE performance.
Please find out how many sectors you read in one go.
Maybe:
# cat /proc/interrupts
# dd bs=1m count=1 if=/dev/hda of=/dev/null
# cat /proc/interrupts
and calculate how many IDE interrupts happened. (1mb = 2048 sectors)
--
vda
On Thu, Oct 24, 2002 at 02:22:25PM -0200, Denis Vlasenko wrote:
> Please delete memory.o, rerun make bzImage, capture gcc
> command used for compiling memory.c, modify it:
>
> gcc ... -o memory.o -> gcc ... -S -o memory.s ...
Have you tried make mm/memory.s ?
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On 24 October 2002 09:50, Russell King wrote:
> On Thu, Oct 24, 2002 at 02:22:25PM -0200, Denis Vlasenko wrote:
> > Please delete memory.o, rerun make bzImage, capture gcc
> > command used for compiling memory.c, modify it:
> >
> > gcc ... -o memory.o -> gcc ... -S -o memory.s ...
>
> Have you tried make mm/memory.s ?
No ;) but I have a feeling it will produce that file ;)))
I'm experimenting with different csum_ routines in userspace now.
--
vda
/me said:
> I'm experimenting with different csum_ routines in userspace now.
Short conclusion:
1. It is possible to speed up csum routines for AMD processors by 30%.
2. It is possible to speed up csum_copy routines for both AMD and Intel
three times or more. Roy, do you like that? ;)
Tests: they checksum 4MB block and csum_copy 2MB into second 2MB.
POISON=0/1 controls whether to perform correctness tests or not.
That slows down test very noticeably. What does glibc use for
memset/memcmp? for() loop?!!
With POISON=1 ntqpf2_copy bugs out, see its source. I left it in
to save repeating my work by others. BTW, i do NOT understand why
it does not work. ;) Anyone with cluebat?
IMHO the only way to make it optimal for all CPUs is to make these
functions race at kernel init and pick the best one.
tests on Celeron 1200 (100 MHz, x12 core)
=========================================
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 717 max, 704 min cycles per kb. sum=0x44000077
kernel_csum - took 4760 max, 704 min cycles per kb. sum=0x44000077
kernel_csum - took 722 max, 704 min cycles per kb. sum=0x44000077
kernelpii_csum - took 539 max, 528 min cycles per kb. sum=0x44000077
kernelpiipf_csum - took 573 max, 529 min cycles per kb. sum=0x44000077
pfm_csum - took 1411 max, 1306 min cycles per kb. sum=0x44000077
pfm2_csum - took 875 max, 762 min cycles per kb. sum=0x44000077
copy tests:
kernel_copy - took 5738 max, 3423 min cycles per kb. sum=0x99aaaacc
kernel_copy - took 3517 max, 3431 min cycles per kb. sum=0x99aaaacc
kernel_copy - took 4385 max, 3432 min cycles per kb. sum=0x99aaaacc
kernelpii_copy - took 2912 max, 2752 min cycles per kb. sum=0x99aaaacc
ntqpf_copy - took 2010 max, 1700 min cycles per kb. sum=0x99aaaacc
ntqpfm_copy - took 1749 max, 1701 min cycles per kb. sum=0x99aaaacc
ntq_copy - took 2218 max, 2141 min cycles per kb. sum=0x99aaaacc
BAD copy! <-- ntqpf2_copy is buggy :) see its source
'copy tests' above are with POISON=1
These are with POISON=0:
kernel_copy - took 2009 max, 1935 min cycles per kb. sum=0x44000077
kernel_copy - took 2240 max, 1959 min cycles per kb. sum=0x44000077
kernel_copy - took 2197 max, 1936 min cycles per kb. sum=0x44000077
kernelpii_copy - took 2121 max, 1939 min cycles per kb. sum=0x44000077
ntqpf_copy - took 667 max, 548 min cycles per kb. sum=0x44000077
ntqpfm_copy - took 651 max, 546 min cycles per kb. sum=0x44000077
ntq_copy - took 660 max, 545 min cycles per kb. sum=0x44000077
ntqpf2_copy - took 644 max, 548 min cycles per kb. sum=0x44000077
Done
Tests on Duron 650 (100 MHz, x6,5 core)
=======================================
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 1090 max, 1051 min cycles per kb. sum=0x44000077
kernel_csum - took 1080 max, 1052 min cycles per kb. sum=0x44000077
kernel_csum - took 1178 max, 1058 min cycles per kb. sum=0x44000077
kernelpii_csum - took 1614 max, 1052 min cycles per kb. sum=0x44000077
kernelpiipf_csum - took 976 max, 962 min cycles per kb. sum=0x44000077
pfm_csum - took 755 max, 746 min cycles per kb. sum=0x44000077
pfm2_csum - took 749 max, 745 min cycles per kb. sum=0x44000077
copy tests:
kernel_copy - took 1251 max, 1072 min cycles per kb. sum=0x99aaaacc
kernel_copy - took 1363 max, 1072 min cycles per kb. sum=0x99aaaacc
kernel_copy - took 1352 max, 1072 min cycles per kb. sum=0x99aaaacc
kernelpii_copy - took 1132 max, 1014 min cycles per kb. sum=0x99aaaacc
ntqpf_copy - took 514 max, 480 min cycles per kb. sum=0x99aaaacc
ntqpfm_copy - took 495 max, 482 min cycles per kb. sum=0x99aaaacc
ntq_copy - took 1153 max, 948 min cycles per kb. sum=0x99aaaacc
BAD copy! <-- ntqpf2_copy is buggy :) see its source
'copy tests' above are with POISON=1
These are with POISON=0:
kernel_copy - took 1145 max, 871 min cycles per kb. sum=0x44000077
kernel_copy - took 879 max, 871 min cycles per kb. sum=0x44000077
kernel_copy - took 876 max, 871 min cycles per kb. sum=0x44000077
kernelpii_copy - took 1019 max, 845 min cycles per kb. sum=0x44000077
ntqpf_copy - took 2972 max, 229 min cycles per kb. sum=0x44000077
ntqpfm_copy - took 248 max, 245 min cycles per kb. sum=0x44000077
ntq_copy - took 460 max, 452 min cycles per kb. sum=0x44000077
ntqpf2_copy - took 390 max, 340 min cycles per kb. sum=0x44000077
Done
--
vda
>>>>> "Denis" == Denis Vlasenko <[email protected]> writes:
Denis> /me said:
>> I'm experimenting with different csum_ routines in userspace now.
Denis> Short conclusion:
Denis> 1. It is possible to speed up csum routines for AMD processors by 30%.
Denis> 2. It is possible to speed up csum_copy routines for both AMD and Intel
Denis> three times or more. Roy, do you like that? ;)
Additional data point:
Short summary:
1. Checksum - kernelpii_csum is ~19% faster
2. Copy - lernelpii_csum is ~6% faster
Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
The only changes I made were to decrease the buffer size to 1K (as I
think this is more representative to a network packet size, correct me
if I'm wrong) and increase the runs to 1024. Max values are worthless
indeed.
Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 941 max, 740 min cycles per kb. sum=0x44000077
kernel_csum - took 748 max, 742 min cycles per kb. sum=0x44000077
kernel_csum - took 60559 max, 742 min cycles per kb. sum=0x44000077
kernelpii_csum - took 52804 max, 601 min cycles per kb. sum=0x44000077
kernelpiipf_csum - took 12930 max, 601 min cycles per kb. sum=0x44000077
pfm_csum - took 10161 max, 1402 min cycles per kb. sum=0x44000077
pfm2_csum - took 864 max, 838 min cycles per kb. sum=0x44000077
copy tests:
kernel_copy - took 339 max, 239 min cycles per kb. sum=0x44000077
kernel_copy - took 239 max, 239 min cycles per kb. sum=0x44000077
kernel_copy - took 239 max, 239 min cycles per kb. sum=0x44000077
kernelpii_copy - took 244 max, 225 min cycles per kb. sum=0x44000077
ntqpf_copy - took 10867 max, 512 min cycles per kb. sum=0x44000077
ntqpfm_copy - took 710 max, 403 min cycles per kb. sum=0x44000077
ntq_copy - took 4535 max, 443 min cycles per kb. sum=0x44000077
ntqpf2_copy - took 563 max, 555 min cycles per kb. sum=0x44000077
Done
HOWEVER ...
sometimes (say 1/30) I get the following output:
Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 958 max, 740 min cycles per kb. sum=0x44000077
kernel_csum - took 748 max, 740 min cycles per kb. sum=0x44000077
kernel_csum - took 752 max, 740 min cycles per kb. sum=0x44000077
kernelpii_csum - took 624 max, 600 min cycles per kb. sum=0x44000077
kernelpiipf_csum - took 877211 max, 601 min cycles per kb. sum=0x44000077
Bad sum
Aborted
which is to say that pfm_csum and pfm2_csum results are not to be
trusted (at least on PIII (or my kernel CONFIG_MPENTIUMIII=y
config?)).
~velco
[please drop libc from CC:]
On 25 October 2002 05:48, Momchil Velikov wrote:
>> Short conclusion:
>> 1. It is possible to speed up csum routines for AMD processors
>> by 30%.
>> 2. It is possible to speed up csum_copy routines for both AMD
>> andd Intel three times or more.
> Additional data point:
>
> Short summary:
> 1. Checksum - kernelpii_csum is ~19% faster
> 2. Copy - lernelpii_csum is ~6% faster
>
> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>
> The only changes I made were to decrease the buffer size to 1K (as I
> think this is more representative to a network packet size, correct
> me if I'm wrong) and increase the runs to 1024. Max values are
> worthless indeed.
Well, that makes it run entirely in L0 cache. This is unrealistic
for actual use. movntq is x3 faster when you hit RAM instead of L0.
You need to be more clever than that - generate pseudo-random
offsets in large buffer and run on ~1K pieces of that buffer.
> HOWEVER ...
>
> sometimes (say 1/30) I get the following output:
Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 958 max, 740 min cycles per kb. sum=0x44000077
kernel_csum - took 748 max, 740 min cycles per kb. sum=0x44000077
kernel_csum - took 752 max, 740 min cycles per kb. sum=0x44000077
kernelpii_csum - took 624 max, 600 min cycles per kb. sum=0x44000077
kernelpiipf_csum - took 877211 max, 601 min cycles per kb. sum=0x44000077
Bad sum
Aborted
> which is to say that pfm_csum and pfm2_csum results are not to be
> trusted (at least on PIII (or my kernel CONFIG_MPENTIUMIII=y
> config?)).
No, it's my fault. Those routines are fast-hacked together, they
are actually can csym too little. I didn't get to handle arbitrary
buffer length, assuming it it a large power of two. See the source.
--
vda
On Fri, 2002-10-25 at 14:59, Denis Vlasenko wrote:
> Well, that makes it run entirely in L0 cache. This is unrealistic
> for actual use. movntq is x3 faster when you hit RAM instead of L0.
>
> You need to be more clever than that - generate pseudo-random
> offsets in large buffer and run on ~1K pieces of that buffer.
In a lot of cases its extremely realistic to assume the network buffers
are in cache. The copy/csum path is often touching just generated data,
or data we just accessed via read(). The csum RX path from a card with
DMA is probably somewhat different.
On 25 October 2002 08:19, Alan Cox wrote:
> On Fri, 2002-10-25 at 14:59, Denis Vlasenko wrote:
> > Well, that makes it run entirely in L0 cache. This is unrealistic
> > for actual use. movntq is x3 faster when you hit RAM instead of L0.
> >
> > You need to be more clever than that - generate pseudo-random
> > offsets in large buffer and run on ~1K pieces of that buffer.
>
> In a lot of cases its extremely realistic to assume the network
> buffers are in cache. The copy/csum path is often touching just
> generated data, or data we just accessed via read(). The csum RX path
> from a card with DMA is probably somewhat different.
'Touching' is not interesting since it will pump data
into cache, no matter how you 'touch' it.
Running benchmarks against 1K static buffer makes cache red hot
and causes _all writes_ to hit it. It may lead to wrong conclusions.
Is _dst_ buffer of csum_copy going to be used by processor soon?
If yes, we shouldn't use movntq, we want to cache dst.
If no, we should by all means use movntq.
If sometimes, then optimal strategy does not exist. :(
--
vda
Am Fre, 2002-10-25 um 13.36 schrieb Denis Vlasenko:
On Via Ezra 667 I get this:
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 2739 max, 2727 min cycles per kb. sum=0x44000077
kernel_csum - took 2733 max, 2727 min cycles per kb. sum=0x44000077
kernel_csum - took 2733 max, 2727 min cycles per kb. sum=0x44000077
kernelpii_csum - took 2691 max, 2686 min cycles per kb. sum=0x44000077
copy tests:
kernel_copy - took 2044 max, 2014 min cycles per kb. sum=0x44000077
kernel_copy - took 2026 max, 2016 min cycles per kb. sum=0x44000077
kernel_copy - took 2061 max, 2016 min cycles per kb. sum=0x44000077
kernelpii_copy - took 1526 max, 1523 min cycles per kb. sum=0x44000077
Done
The nt* functions do not work on this CPU.
--
Servus,
Daniel