Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Sat, 22 Mar 2003 20:47:19 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Sat, 22 Mar 2003 20:47:18 -0500 Received: from packet.digeo.com ([12.110.80.53]:60909 "EHLO packet.digeo.com") by vger.kernel.org with ESMTP id ; Sat, 22 Mar 2003 20:47:16 -0500 Date: Sat, 22 Mar 2003 17:58:16 -0800 From: Andrew Morton To: linux-kernel@vger.kernel.org Subject: smp overhead, and rwlocks considered harmful Message-Id: <20030322175816.225a1f23.akpm@digeo.com> X-Mailer: Sylpheed version 0.8.9 (GTK+ 1.2.10; i586-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 23 Mar 2003 01:58:15.0165 (UTC) FILETIME=[AAAD42D0:01C2F0DF] Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4731 Lines: 115 I've been looking at the CPU cost of the write() system call. Time how long it takes to write a million bytes to an ext2 file, via a million one-byte-writes: time dd if=/dev/zero of=foo bs=1 count=1M This only uses one CPU. It takes twice as long on SMP. On a 2.7GHz P4-HT: 2.5.65-mm4, UP: 0.34s user 1.00s system 99% cpu 1.348 total 2.5.65-mm4, SMP: 0.41s user 2.04s system 100% cpu 2.445 total 2.4.21-pre5, UP: 0.34s user 0.96s system 106% cpu 1.224 total 2.4.21-pre5, SMP: 0.42s user 1.95s system 99% cpu 2.372 total (The small additional overhead in 2.5 is expected - there are more function calls due to the addition of AIO and there is more setup due to the (large) writev speedups). On a 500MHz PIII Xeon: 500MHz PIII, UP: 1.08s user 2.90s system 100% cpu 3.971 total 500MHz PIII, SMP: 1.13s user 4.86s system 99% cpu 5.999 total This pretty gross. About six months back I worked out that across the lifecycle of a pagecache page (creation via write() through to reclaim via the page LRU) we take 27 spinlocks and rwlocks. And this does not even include semaphores and atomic bitops (I used lockmeter). I'm not sure it is this high any more - quite a few things were fixed up, but it is still high. Profiles for 2.5 on the P4 show that it's all in fget(), fput() and find_get_page(). Those locked operations are really hurting. One thing is noteworthy: ia32's read_unlock() is buslocked, whereas spin_unlock() is not. So let's see what happens if we convert file_lock from an rwlock to a spinlock: 2.5.65-mm4, SMP: 0.34s user 2.00s system 100% cpu 2.329 total That's a 5% speedup. And if we were to convert file->f_count to be a nonatomic "int", protected by files_lock it would probably speed things up further. I've always been a bit skeptical about rwlocks - if you're holding the lock for long enough for a significant amount of reader concurrency, you're holding it for too long. eg: tasklist_lock. SMP: c0254a14 read_zero 189 0.3841 c0250f5c clear_user 217 3.3906 c0148a64 sys_write 251 3.9219 c0148a24 sys_read 268 4.1875 c014b00c __block_prepare_write 485 0.5226 c0130dd4 unlock_page 493 7.7031 c0120220 current_kernel_time 560 8.7500 c0163190 __mark_inode_dirty 775 3.5227 c01488ec vfs_write 983 3.1506 c0132d9c generic_file_write 996 10.3750 c01486fc vfs_read 1010 3.2372 c014b3ac __block_commit_write 1100 7.6389 c0149500 fput 1110 39.6429 c01322e4 generic_file_aio_write_nolock 1558 0.6292 c0130fc0 find_lock_page 2301 13.0739 c01495f0 fget 2780 33.0952 c0108ddc system_call 4047 91.9773 c0106f64 default_idle 24624 473.5385 00000000 total 45321 0.0200 UP: c023c9ac radix_tree_lookup 13 0.1711 c015362c inode_times_differ 14 0.1944 c015372c inode_update_time 14 0.1000 c012cb9c generic_file_write 15 0.1630 c012ca90 generic_file_write_nolock 17 0.1214 c0140e1c fget 17 0.3542 c023deb8 __copy_from_user_ll 17 0.1545 c0241514 read_zero 17 0.0346 c014001c vfs_read 18 0.0682 c01401dc vfs_write 20 0.0758 c01430d8 generic_commit_write 20 0.1786 c01402e4 sys_read 21 0.3281 c0142968 __block_commit_write 29 0.2014 c023dc4c clear_user 34 0.5312 c0140324 sys_write 38 0.5938 c01425cc __block_prepare_write 39 0.0422 c012c0f4 generic_file_aio_write_nolock 89 0.0362 c0108b54 system_call 406 9.2273 00000000 total 944 0.0004 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/