2005-04-28 18:31:16

by Benjamin LaHaise

[permalink] [raw]
Subject: [RFC] unify semaphore implementations

Hello all,

Please review the following series of patches for unifying the
semaphore implementation across all architectures (not posted as
they're about 350K), as they have only been tested on x86-64. The
code generated is functionally identical to the earlier i386
variant, but since gcc has no way of taking condition codes as
results, there are two additional instructions inserted from the
use of generic atomic operations. All told the >6000 lines of code
deleted makes for a much easier job for subsequent patches changing
semaphore functionality. Cheers,

-ben

http://www.kvack.org/~bcrl/patches/sem-cleanup-A2/10-rename_semaphore_h.diff
Introduce linux/semaphore.h. Convert all users of asm/semaphore.h over
to linux/semaphore.h.

http://www.kvack.org/~bcrl/patches/sem-cleanup-A2/20-move_rwlock.diff
Move i386 rwlock helper functions out of semaphore.c and into their own
file rwlock.c.

http://www.kvack.org/~bcrl/patches/sem-cleanup-A2/30-one_semaphore.diff
Replace all semaphore implementations with a single implementation
derrived from the i386 code using atomic operations. Tested on x86-64,
compiled on i386 and ia64.
--
"Time is what keeps everything from happening all at once." -- John Wheeler

sem-10/arch/alpha/kernel/alpha_ksyms.c | 2
sem-10/arch/arm/common/rtctime.c | 2
sem-10/arch/arm/kernel/semaphore.c | 5
sem-10/arch/arm/mach-integrator/clock.c | 2
sem-10/arch/arm/mach-omap/clock.c | 2
sem-10/arch/arm/mach-versatile/clock.c | 2
sem-10/arch/arm/oprofile/common.c | 2
sem-10/arch/arm26/kernel/armksyms.c | 2
sem-10/arch/arm26/kernel/semaphore.c | 5
sem-10/arch/arm26/kernel/traps.c | 2
sem-10/arch/cris/kernel/crisksyms.c | 2
sem-10/arch/cris/kernel/semaphore.c | 4
sem-10/arch/frv/kernel/frv_ksyms.c | 2
sem-10/arch/frv/kernel/semaphore.c | 2
sem-10/arch/h8300/kernel/h8300_ksyms.c | 2
sem-10/arch/h8300/kernel/semaphore.c | 4
sem-10/arch/i386/kernel/cpu/common.c | 2
sem-10/arch/i386/kernel/cpu/proc.c | 2
sem-10/arch/i386/kernel/i386_ksyms.c | 2
sem-10/arch/i386/kernel/semaphore.c | 4
sem-10/arch/ia64/ia32/sys_ia32.c | 2
sem-10/arch/ia64/kernel/ia64_ksyms.c | 2
sem-10/arch/ia64/kernel/salinfo.c | 2
sem-10/arch/ia64/kernel/semaphore.c | 4
sem-10/arch/ia64/sn/kernel/sn2/sn_hwperf.c | 2
sem-10/arch/m32r/kernel/m32r_ksyms.c | 2
sem-10/arch/m32r/kernel/semaphore.c | 4
sem-10/arch/m68k/atari/stram.c | 2
sem-10/arch/m68k/kernel/m68k_ksyms.c | 2
sem-10/arch/m68k/kernel/semaphore.c | 4
sem-10/arch/m68k/sun3/intersil.c | 2
sem-10/arch/m68knommu/kernel/m68k_ksyms.c | 2
sem-10/arch/m68knommu/kernel/semaphore.c | 4
sem-10/arch/mips/kernel/semaphore.c | 2
sem-10/arch/mips/lasat/picvue.h | 2
sem-10/arch/mips/sgi-ip27/ip27-console.c | 2
sem-10/arch/parisc/kernel/parisc_ksyms.c | 2
sem-10/arch/parisc/kernel/semaphore.c | 4
sem-10/arch/parisc/kernel/sys_parisc32.c | 2
sem-10/arch/ppc/kernel/ppc_ksyms.c | 2
sem-10/arch/ppc/kernel/semaphore.c | 2
sem-10/arch/ppc/kernel/syscalls.c | 2
sem-10/arch/ppc/syslib/ocp.c | 2
sem-10/arch/ppc64/kernel/rtas.c | 2
sem-10/arch/ppc64/kernel/semaphore.c | 2
sem-10/arch/ppc64/kernel/sys_ppc32.c | 2
sem-10/arch/ppc64/kernel/syscalls.c | 2
sem-10/arch/ppc64/mm/imalloc.c | 2
sem-10/arch/s390/kernel/compat_linux.c | 2
sem-10/arch/s390/kernel/debug.c | 2
sem-10/arch/s390/kernel/semaphore.c | 2
sem-10/arch/sh/kernel/semaphore.c | 6
sem-10/arch/sh/kernel/sh_ksyms.c | 2
sem-10/arch/sh/mm/pg-dma.c | 2
sem-10/arch/sh64/kernel/semaphore.c | 6
sem-10/arch/sh64/kernel/sh_ksyms.c | 2
sem-10/arch/sparc/kernel/semaphore.c | 4
sem-10/arch/sparc64/kernel/sys_sparc32.c | 2
sem-10/arch/um/drivers/port_kern.c | 2
sem-10/arch/um/drivers/xterm_kern.c | 2
sem-10/arch/um/include/line.h | 2
sem-10/arch/um/kernel/trap_kern.c | 2
sem-10/arch/um/sys-i386/ksyms.c | 2
sem-10/arch/v850/kernel/semaphore.c | 4
sem-10/arch/v850/kernel/syscalls.c | 2
sem-10/arch/v850/kernel/v850_ksyms.c | 2
sem-10/arch/x86_64/ia32/sys_ia32.c | 2
sem-10/arch/x86_64/kernel/semaphore.c | 4
sem-10/arch/x86_64/kernel/x8664_ksyms.c | 2
sem-10/drivers/atm/idt77252.c | 2
sem-10/drivers/base/core.c | 2
sem-10/drivers/base/firmware_class.c | 2
sem-10/drivers/base/power/shutdown.c | 2
sem-10/drivers/block/cryptoloop.c | 2
sem-10/drivers/block/sx8.c | 2
sem-10/drivers/char/generic_serial.c | 2
sem-10/drivers/char/ipmi/ipmi_devintf.c | 2
sem-10/drivers/char/ipmi/ipmi_poweroff.c | 2
sem-10/drivers/char/rio/rioboot.c | 2
sem-10/drivers/char/rio/riocmd.c | 2
sem-10/drivers/char/rio/rioctrl.c | 2
sem-10/drivers/char/rio/rioinit.c | 2
sem-10/drivers/char/rio/riointr.c | 2
sem-10/drivers/char/rio/rioparam.c | 2
sem-10/drivers/char/rio/rioroute.c | 2
sem-10/drivers/char/rio/riotable.c | 2
sem-10/drivers/char/rio/riotty.c | 2
sem-10/drivers/char/rocket.c | 2
sem-10/drivers/char/ser_a2232.c | 2
sem-10/drivers/char/snsc.h | 2
sem-10/drivers/char/watchdog/sc1200wdt.c | 2
sem-10/drivers/fc4/fc.c | 2
sem-10/drivers/i2c/busses/i2c-ali1535.c | 2
sem-10/drivers/ieee1394/eth1394.c | 2
sem-10/drivers/ieee1394/hosts.h | 2
sem-10/drivers/ieee1394/ieee1394_core.c | 2
sem-10/drivers/ieee1394/ieee1394_core.h | 2
sem-10/drivers/ieee1394/ieee1394_types.h | 2
sem-10/drivers/infiniband/core/device.c | 2
sem-10/drivers/infiniband/core/user_mad.c | 2
sem-10/drivers/infiniband/hw/mthca/mthca_dev.h | 2
sem-10/drivers/infiniband/hw/mthca/mthca_memfree.h | 2
sem-10/drivers/infiniband/ulp/ipoib/ipoib.h | 2
sem-10/drivers/input/joystick/iforce/iforce.h | 2
sem-10/drivers/macintosh/adb.c | 2
sem-10/drivers/media/dvb/dvb-core/dmxdev.h | 2
sem-10/drivers/media/dvb/dvb-core/dvb_demux.h | 2
sem-10/drivers/media/dvb/dvb-core/dvb_frontend.c | 2
sem-10/drivers/media/dvb/ttpci/av7110.c | 2
sem-10/drivers/media/dvb/ttusb-budget/dvb-ttusb-budget.c | 2
sem-10/drivers/media/dvb/ttusb-dec/ttusb_dec.c | 2
sem-10/drivers/media/radio/miropcm20-rds-core.c | 2
sem-10/drivers/media/radio/radio-aimslab.c | 2
sem-10/drivers/media/radio/radio-maestro.c | 2
sem-10/drivers/media/radio/radio-maxiradio.c | 2
sem-10/drivers/media/radio/radio-sf16fmi.c | 2
sem-10/drivers/media/radio/radio-sf16fmr2.c | 2
sem-10/drivers/media/video/arv.c | 2
sem-10/drivers/media/video/bw-qcam.c | 2
sem-10/drivers/media/video/c-qcam.c | 2
sem-10/drivers/media/video/cpia.c | 2
sem-10/drivers/media/video/ir-kbd-i2c.c | 2
sem-10/drivers/media/video/msp3400.c | 2
sem-10/drivers/media/video/planb.c | 2
sem-10/drivers/media/video/tvmixer.c | 2
sem-10/drivers/media/video/videodev.c | 2
sem-10/drivers/mtd/mtd_blkdevs.c | 2
sem-10/drivers/net/3c527.c | 2
sem-10/drivers/net/hamradio/6pack.c | 2
sem-10/drivers/net/ibmveth.c | 2
sem-10/drivers/net/plip.c | 2
sem-10/drivers/net/ppp_synctty.c | 2
sem-10/drivers/oprofile/event_buffer.h | 2
sem-10/drivers/oprofile/oprof.c | 2
sem-10/drivers/pci/hotplug/acpiphp_glue.c | 2
sem-10/drivers/pci/hotplug/pciehp.h | 2
sem-10/drivers/pci/hotplug/rpadlpar_core.c | 2
sem-10/drivers/pci/hotplug/shpchp.h | 2
sem-10/drivers/s390/char/sclp_cpi.c | 2
sem-10/drivers/s390/cio/ccwgroup.c | 2
sem-10/drivers/s390/cio/qdio.c | 2
sem-10/drivers/s390/net/qeth_main.c | 2
sem-10/drivers/scsi/aacraid/aachba.c | 2
sem-10/drivers/scsi/aacraid/commctrl.c | 2
sem-10/drivers/scsi/aacraid/comminit.c | 2
sem-10/drivers/scsi/aacraid/commsup.c | 2
sem-10/drivers/scsi/aacraid/dpcsup.c | 2
sem-10/drivers/scsi/aacraid/linit.c | 2
sem-10/drivers/scsi/aacraid/rkt.c | 2
sem-10/drivers/scsi/aacraid/rx.c | 2
sem-10/drivers/scsi/aacraid/sa.c | 2
sem-10/drivers/scsi/aha152x.c | 2
sem-10/drivers/scsi/dpt/dpti_i2o.h | 2
sem-10/drivers/scsi/libata-core.c | 2
sem-10/drivers/scsi/megaraid/mega_common.h | 2
sem-10/drivers/scsi/megaraid/megaraid_ioctl.h | 2
sem-10/drivers/scsi/qla2xxx/qla_def.h | 2
sem-10/drivers/scsi/scsi_scan.c | 2
sem-10/drivers/scsi/scsi_transport_spi.c | 2
sem-10/drivers/serial/mcfserial.c | 2
sem-10/drivers/serial/pmac_zilog.c | 2
sem-10/drivers/usb/atm/usb_atm.h | 2
sem-10/drivers/usb/class/usb-midi.c | 2
sem-10/drivers/usb/core/hub.c | 2
sem-10/drivers/usb/media/ov511.c | 2
sem-10/drivers/usb/media/pwc/pwc.h | 2
sem-10/drivers/usb/media/sn9c102.h | 2
sem-10/drivers/usb/media/w9968cf.h | 2
sem-10/drivers/usb/net/kaweth.c | 2
sem-10/drivers/usb/serial/io_ti.c | 2
sem-10/drivers/usb/serial/ti_usb_3410_5052.c | 2
sem-10/drivers/w1/w1.h | 2
sem-10/fs/cramfs/inode.c | 2
sem-10/fs/eventpoll.c | 2
sem-10/fs/isofs/compress.c | 2
sem-10/fs/jffs/inode-v23.c | 2
sem-10/fs/jffs/intrep.c | 2
sem-10/fs/jffs2/compr_zlib.c | 2
sem-10/fs/locks.c | 2
sem-10/fs/ntfs/inode.h | 2
sem-10/fs/partitions/devfs.c | 2
sem-10/fs/reiserfs/journal.c | 2
sem-10/fs/reiserfs/xattr.c | 2
sem-10/fs/sysfs/file.c | 2
sem-10/fs/xfs/linux-2.6/mutex.h | 2
sem-10/fs/xfs/linux-2.6/sema.h | 2
sem-10/include/asm-i386/mmu.h | 2
sem-10/include/asm-ia64/sn/nodepda.h | 2
sem-10/include/asm-ppc/ocp.h | 2
sem-10/include/asm-sh/dma.h | 2
sem-10/include/asm-x86_64/mmu.h | 2
sem-10/include/linux/cpu.h | 2
sem-10/include/linux/devfs_fs_kernel.h | 2
sem-10/include/linux/device.h | 2
sem-10/include/linux/fs.h | 2
sem-10/include/linux/i2c.h | 2
sem-10/include/linux/i2o.h | 2
sem-10/include/linux/ide.h | 2
sem-10/include/linux/if_pppox.h | 2
sem-10/include/linux/jbd.h | 2
sem-10/include/linux/jffs2_fs_i.h | 2
sem-10/include/linux/jffs2_fs_sb.h | 2
sem-10/include/linux/lp.h | 2
sem-10/include/linux/mtd/blktrans.h | 2
sem-10/include/linux/mtd/doc2000.h | 2
sem-10/include/linux/parport.h | 2
sem-10/include/linux/raid/md.h | 2
sem-10/include/linux/sched.h | 2
sem-10/include/linux/seq_file.h | 2
sem-10/include/linux/syscalls.h | 2
sem-10/include/linux/udf_fs_sb.h | 2
sem-10/include/sound/core.h | 2
sem-10/include/sound/rawmidi.h | 2
sem-10/ipc/compat.c | 2
sem-10/kernel/cpu.c | 2
sem-10/kernel/cpuset.c | 2
sem-10/kernel/kthread.c | 2
sem-10/kernel/module.c | 2
sem-10/kernel/posix-timers.c | 2
sem-10/kernel/profile.c | 2
sem-10/kernel/stop_machine.c | 2
sem-10/lib/reed_solomon/reed_solomon.c | 2
sem-10/net/core/flow.c | 2
sem-10/net/ipv4/ipcomp.c | 2
sem-10/net/ipv4/netfilter/arp_tables.c | 2
sem-10/net/ipv4/netfilter/ip_tables.c | 2
sem-10/net/ipv6/ipcomp6.c | 2
sem-10/net/ipv6/netfilter/ip6_tables.c | 2
sem-10/security/selinux/hooks.c | 2
sem-10/security/selinux/selinuxfs.c | 2
sem-10/security/selinux/ss/conditional.c | 2
sem-10/security/selinux/ss/services.c | 2
sem-10/sound/arm/sa11xx-uda1341.c | 2
sem-10/sound/core/memalloc.c | 2
sem-10/sound/core/seq/seq_midi.c | 2
sem-10/sound/oss/aci.c | 2
sem-10/sound/oss/dmasound/dmasound_awacs.c | 2
sem-10/sound/oss/via82cxxx_audio.c | 2
sem-10/sound/oss/vwsnd.c | 2
sem-20/arch/i386/kernel/Makefile | 1
sem-20/arch/i386/kernel/rwlock.c | 50 ++
sem-20/arch/i386/kernel/semaphore.c | 33 -
sem-30/Makefile | 2
sem-30/arch/alpha/kernel/Makefile | 2
sem-30/arch/alpha/kernel/alpha_ksyms.c | 9
sem-30/arch/alpha/kernel/semaphore.c | 224 ------------
sem-30/arch/arm/kernel/Makefile | 2
sem-30/arch/arm/kernel/semaphore.c | 219 ------------
sem-30/arch/arm26/kernel/Makefile | 2
sem-30/arch/arm26/kernel/semaphore.c | 222 ------------
sem-30/arch/cris/kernel/Makefile | 2
sem-30/arch/cris/kernel/crisksyms.c | 6
sem-30/arch/cris/kernel/semaphore.c | 130 -------
sem-30/arch/frv/kernel/Makefile | 2
sem-30/arch/frv/kernel/semaphore.c | 156 --------
sem-30/arch/h8300/kernel/Makefile | 2
sem-30/arch/h8300/kernel/semaphore.c | 133 -------
sem-30/arch/i386/kernel/Makefile | 2
sem-30/arch/i386/kernel/i386_ksyms.c | 4
sem-30/arch/i386/kernel/semaphore.c | 264 ---------------
sem-30/arch/ia64/kernel/Makefile | 2
sem-30/arch/ia64/kernel/ia64_ksyms.c | 6
sem-30/arch/ia64/kernel/semaphore.c | 165 ---------
sem-30/arch/m32r/kernel/Makefile | 2
sem-30/arch/m32r/kernel/m32r_ksyms.c | 4
sem-30/arch/m32r/kernel/semaphore.c | 186 ----------
sem-30/arch/m68k/kernel/Makefile | 2
sem-30/arch/m68k/kernel/m68k_ksyms.c | 5
sem-30/arch/m68k/kernel/semaphore.c | 133 -------
sem-30/arch/m68k/lib/Makefile | 2
sem-30/arch/m68k/lib/semaphore.S | 53 ---
sem-30/arch/m68knommu/kernel/Makefile | 2
sem-30/arch/m68knommu/kernel/m68k_ksyms.c | 5
sem-30/arch/m68knommu/kernel/semaphore.c | 134 -------
sem-30/arch/m68knommu/lib/Makefile | 2
sem-30/arch/m68knommu/lib/semaphore.S | 67 ---
sem-30/arch/mips/kernel/Makefile | 2
sem-30/arch/mips/kernel/semaphore.c | 164 ---------
sem-30/arch/parisc/kernel/Makefile | 2
sem-30/arch/parisc/kernel/parisc_ksyms.c | 5
sem-30/arch/parisc/kernel/semaphore.c | 102 -----
sem-30/arch/ppc/kernel/Makefile | 2
sem-30/arch/ppc/kernel/ppc_ksyms.c | 3
sem-30/arch/ppc/kernel/semaphore.c | 131 -------
sem-30/arch/ppc64/kernel/Makefile | 2
sem-30/arch/ppc64/kernel/semaphore.c | 136 -------
sem-30/arch/s390/kernel/Makefile | 2
sem-30/arch/s390/kernel/s390_ksyms.c | 7
sem-30/arch/s390/kernel/semaphore.c | 108 ------
sem-30/arch/sh/kernel/Makefile | 2
sem-30/arch/sh/kernel/semaphore.c | 139 -------
sem-30/arch/sh/kernel/sh_ksyms.c | 6
sem-30/arch/sh64/kernel/Makefile | 2
sem-30/arch/sh64/kernel/semaphore.c | 140 -------
sem-30/arch/sh64/kernel/sh_ksyms.c | 3
sem-30/arch/sparc/kernel/Makefile | 2
sem-30/arch/sparc/kernel/semaphore.c | 155 --------
sem-30/arch/sparc/kernel/sparc_ksyms.c | 5
sem-30/arch/sparc64/kernel/Makefile | 2
sem-30/arch/sparc64/kernel/semaphore.c | 251 --------------
sem-30/arch/sparc64/kernel/sparc64_ksyms.c | 6
sem-30/arch/um/sys-i386/ksyms.c | 6
sem-30/arch/v850/kernel/Makefile | 2
sem-30/arch/v850/kernel/semaphore.c | 166 ---------
sem-30/arch/v850/kernel/v850_ksyms.c | 6
sem-30/arch/x86_64/kernel/Makefile | 2
sem-30/arch/x86_64/kernel/semaphore.c | 180 ----------
sem-30/arch/x86_64/kernel/x8664_ksyms.c | 4
sem-30/arch/x86_64/lib/thunk.S | 16
sem-30/include/asm-alpha/semaphore.h | 153 --------
sem-30/include/asm-arm/semaphore.h | 106 ------
sem-30/include/asm-arm26/semaphore.h | 103 -----
sem-30/include/asm-cris/semaphore.h | 142 --------
sem-30/include/asm-frv/semaphore.h | 161 ---------
sem-30/include/asm-h8300/semaphore-helper.h | 86 ----
sem-30/include/asm-h8300/semaphore.h | 194 -----------
sem-30/include/asm-i386/semaphore.h | 194 -----------
sem-30/include/asm-ia64/semaphore.h | 102 -----
sem-30/include/asm-m32r/semaphore.h | 205 -----------
sem-30/include/asm-m68k/semaphore.h | 167 ---------
sem-30/include/asm-m68knommu/semaphore.h | 157 --------
sem-30/include/asm-mips/semaphore.h | 112 ------
sem-30/include/asm-parisc/semaphore.h | 147 --------
sem-30/include/asm-ppc/semaphore.h | 111 ------
sem-30/include/asm-ppc64/semaphore.h | 98 -----
sem-30/include/asm-s390/semaphore.h | 110 ------
sem-30/include/asm-sh/semaphore.h | 119 ------
sem-30/include/asm-sh64/semaphore.h | 123 ------
sem-30/include/asm-sparc/semaphore.h | 196 -----------
sem-30/include/asm-sparc64/semaphore.h | 57 ---
sem-30/include/asm-um/semaphore.h | 6
sem-30/include/asm-v850/semaphore.h | 88 -----
sem-30/include/asm-x86_64/semaphore.h | 196 -----------
sem-30/include/linux/semaphore.h | 147 ++++++++
sem-30/kernel/Makefile | 3
sem-30/kernel/semaphore.c | 179 ++++++++++
336 files changed, 659 insertions(+), 7315 deletions(-)


2005-04-28 18:48:37

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On Thu, 2005-04-28 at 14:29 -0400, Benjamin LaHaise wrote:
> Please review the following series of patches for unifying the
> semaphore implementation across all architectures (not posted as
> they're about 350K), as they have only been tested on x86-64. The
> code generated is functionally identical to the earlier i386
> variant, but since gcc has no way of taking condition codes as
> results, there are two additional instructions inserted from the
> use of generic atomic operations. All told the >6000 lines of code
> deleted makes for a much easier job for subsequent patches changing
> semaphore functionality. Cheers,

It's all very well for platforms that have efficient atomic operations.
However, on parisc we have no such luxury (the processor has no atomic
operations, so we have to fiddle them in the kernel using locks), so it
looks like you're making our semaphore operations less efficient.

Could you come up with a less monolithic way to share this so that we
can still do a spinlock semaphore implementation instead of an atomic op
based one?

Thanks,

James


2005-04-28 19:00:24

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On Thu, Apr 28, 2005 at 11:48:09AM -0700, James Bottomley wrote:
> Could you come up with a less monolithic way to share this so that we
> can still do a spinlock semaphore implementation instead of an atomic op
> based one?

As I read the code, it doesn't make a difference: parisc will take a
spin lock within the atomic operation and then release it, which makes
the old fast path for the semaphores and the new fast path pretty much
equivalent (they both take and release one spinlock). The only extra
cost is the address computation for the spinlock. If there is contention
for the atomic spinlocks, then parisc can increase the number of buckets
in their hashed spinlocks.

-ben
--
"Time is what keeps everything from happening all at once." -- John Wheeler

2005-04-28 19:03:06

by David Miller

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On Thu, 28 Apr 2005 14:59:56 -0400
Benjamin LaHaise <[email protected]> wrote:

> On Thu, Apr 28, 2005 at 11:48:09AM -0700, James Bottomley wrote:
> > Could you come up with a less monolithic way to share this so that we
> > can still do a spinlock semaphore implementation instead of an atomic op
> > based one?
>
> As I read the code, it doesn't make a difference: parisc will take a
> spin lock within the atomic operation and then release it, which makes
> the old fast path for the semaphores and the new fast path pretty much
> equivalent (they both take and release one spinlock).

I think parisc should be allowed to choose their implementation of
semaphores. Look, if you change semaphores in some way it will
be their problem to keep their parisc version in sync.

Or you could provide both a spinlocked and an atomic op based implementation
of generic semaphores, as we do for rwsem already.

2005-04-28 22:42:00

by Russell King

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On Thu, Apr 28, 2005 at 02:29:26PM -0400, Benjamin LaHaise wrote:
> Please review the following series of patches for unifying the
> semaphore implementation across all architectures (not posted as
> they're about 350K), as they have only been tested on x86-64. The
> code generated is functionally identical to the earlier i386
> variant, but since gcc has no way of taking condition codes as
> results, there are two additional instructions inserted from the
> use of generic atomic operations. All told the >6000 lines of code
> deleted makes for a much easier job for subsequent patches changing
> semaphore functionality. Cheers,

I'm not sure why we're doing this, apart from a desire to unify stuff.
What happened to efficiency and performance?

It is my understanding that the inline part of the semaphore
implementation was one of the critical areas - critical enough to
warrant coding it in assembly for some people.

So, I set about testing this implementation against the ARM
implementation. For this, I used this code:

void it(struct semaphore *sem, int *val)
{
down(sem);
*val = *val + 1;
up(sem);
}

which is a relatively simple test case.

The ARM assembly implementation for this gave:

str lr, [sp, #-4]!
@ down_op
mrs ip, cpsr @ local_irq_save
orr lr, ip, #128
msr cpsr_c, lr
ldr lr, [r0]
subs lr, lr, #1
str lr, [r0]
msr cpsr_c, ip @ local_irq_restore
movmi ip, r0
blmi __down_failed
ldr r3, [r1, #0]
add r3, r3, #1
str r3, [r1, #0]
@ up_op
mrs ip, cpsr @ local_irq_save
orr lr, ip, #128
msr cpsr_c, lr
ldr lr, [r0]
adds lr, lr, #1
str lr, [r0]
msr cpsr_c, ip @ local_irq_restore
movle ip, r0
blle __up_wakeup
ldr pc, [sp], #4

Stack: 1 location
Registers: 3 on top of the two arguments
Instructions: 23, 23 in the common execution path.

The atomic-op based implementation gave:

stmfd sp!, {r4, r5, lr}
mov r5, r1
mov r4, r0
mrs r2, cpsr @ local_irq_save
orr r1, r2, #128
msr cpsr_c, r1
ldr r3, [r0, #0]
sub r3, r3, #1
str r3, [r0, #0]
msr cpsr_c, r2 @ local_irq_restore
cmp r3, #0
blt .L8
.L4: ldr r3, [r5, #0]
add r3, r3, #1
str r3, [r5, #0]
mrs r2, cpsr @ local_irq_save
orr r1, r2, #128
msr cpsr_c, r1
ldr r3, [r4, #0]
add r3, r3, #1
str r3, [r4, #0]
msr cpsr_c, r2 @ local_irq_restore
cmp r3, #0
mov r0, r4
ldmgtfd sp!, {r4, r5, pc}
ldmfd sp!, {r4, r5, lr}
b up_wakeup
.L8: bl down_failed
b .L4

Stack: 3 locations
Registers: 5 on top of the two arguments
Instructions: 29, 27 in the common execution path.

So, Ben's implementation is more expensive in terms of stack usage,
register usage, and number of instructions.

Why is this? The answer is that gcc is unable to properly optimise
for ARM - I've no idea why. One such simple thing is that the
sequence:

blt .L8
.L4:
...
.L8: bl down_failed
b .L4

can be easily rewritten as:

bllt down_failed

Another reason for the extra code bloat is that GCC can't propagate
PSR flags between the sub/add instructions across a memory barrier
(which local_irq_restore is.)

This makes the ARM assembly code implementation superior to any unified
version, no matter how you look at it from efficiency or performance
points of view. The _only_ win is that of unification.

However, given that the kernel has bloated itself by 300K per stable
kernel series on ARM for the same functionality, I'd prefer to keep
my more efficient higher performance smaller code size semaphores
please.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-28 22:56:19

by David Howells

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations


Benjamin LaHaise <[email protected]> wrote:

> Please review the following series of patches for unifying the
> semaphore implementation across all architectures (not posted as
> they're about 350K), as they have only been tested on x86-64.

It's not as efficient as it could be. You get the semaphore spinlock again in
down_*failed() after you've slept. You can probably avoid doing that. If you
want just to wake up the first thing in the queue each time (as you seem to do
by using add_wait_queue_exclusive_locked()) then why not do as rwsems do and
make up() transfer ownership directly? up() has to take the spinlock anyway
when it examines the queue of sleepers. That way up() can also transfer the
ownership to things other than processes also; something that doesn't appear
easy with waitqueues.

And if you want to thoroughly test and benchmark your semaphores, see the
attached module. It tests both semaphores (as mutexes) and rw-semaphores.

David

/* rwsem-any.c: run some threads to do R/W semaphore tests
*
* Copyright (C) 2005 Red Hat, Inc. All Rights Reserved.
* Written by David Howells ([email protected])
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*
* run as: insmod rwsem-any.ko rd=1 wr=1 dg=0 mx=0
*/

#include <linux/config.h>
#include <linux/module.h>
#include <linux/poll.h>
#include <linux/moduleparam.h>
#include <linux/stat.h>
#include <linux/init.h>
#include <asm/atomic.h>
#include <linux/personality.h>
#include <linux/smp_lock.h>
#include <linux/delay.h>
#include <linux/timer.h>
#include <linux/completion.h>

#define SCHED
#define LOAD_TEST

static int nummx = 0, numrd = 1, numwr = 1, numdg = 1, elapse = 5;
static int verbose = 0;

MODULE_AUTHOR("David Howells");
MODULE_DESCRIPTION("R/W semaphore test demo");
MODULE_LICENSE("GPL");

module_param_named(v, verbose, int, 0);
MODULE_PARM_DESC(verbose, "Verbosity");

module_param_named(mx, nummx, int, 0);
MODULE_PARM_DESC(nummx, "Number of mutex threads");

module_param_named(rd, numrd, int, 0);
MODULE_PARM_DESC(numrd, "Number of reader threads");

module_param_named(wr, numwr, int, 0);
MODULE_PARM_DESC(numwr, "Number of writer threads");

module_param_named(dg, numdg, int, 0);
MODULE_PARM_DESC(numdg, "Number of downgrader threads");

module_param(elapse, int, 0);
MODULE_PARM_DESC(elapse, "Number of seconds to run for");

#ifdef LOAD_TEST
static int load = 0;
module_param(load, int, 0);
MODULE_PARM_DESC(load, "Length of load in uS");
#endif

#ifdef SCHED
static int do_sched = 0;
module_param(do_sched, int, 0);
MODULE_PARM_DESC(do_sched, "True if each thread should schedule regularly");
#endif

/* the semaphores under test */
static struct semaphore mutex_sem;
static struct rw_semaphore rwsem_sem;


#ifdef DEBUG_RWSEM
extern struct rw_semaphore *rwsem_to_monitor;
#endif

static atomic_t mutexes = ATOMIC_INIT(0);
static atomic_t readers = ATOMIC_INIT(0);
static atomic_t writers = ATOMIC_INIT(0);
static atomic_t do_stuff = ATOMIC_INIT(0);
static atomic_t mutexes_taken = ATOMIC_INIT(0);
static atomic_t reads_taken = ATOMIC_INIT(0);
static atomic_t writes_taken = ATOMIC_INIT(0);
static atomic_t downgrades_taken = ATOMIC_INIT(0);

static struct completion mx_comp[20], rd_comp[20], wr_comp[20], dg_comp[20];

static struct timer_list timer;

#define CHECK(expr) \
do { \
if (!(expr)) \
printk("check " #expr " failed in %s\n", __func__); \
} while (0)

#define CHECKA(var, val) \
do { \
int x = atomic_read(&(var)); \
if (x != (val)) \
printk("check [%s != %d, == %d] failed in %s\n", #var, (val), x, __func__); \
} while (0)

static inline void d(void)
{
down(&mutex_sem);
atomic_inc(&mutexes);
atomic_inc(&mutexes_taken);
CHECKA(mutexes, 1);
}

static inline void u(void)
{
CHECKA(mutexes, 1);
atomic_dec(&mutexes);
up(&mutex_sem);
}

static inline void dr(void)
{
down_read(&rwsem_sem);
atomic_inc(&readers);
atomic_inc(&reads_taken);
CHECKA(writers, 0);
}

static inline void ur(void)
{
CHECKA(writers, 0);
atomic_dec(&readers);
up_read(&rwsem_sem);
}

static inline void dw(void)
{
down_write(&rwsem_sem);
atomic_inc(&writers);
atomic_inc(&writes_taken);
CHECKA(writers, 1);
CHECKA(readers, 0);
}

static inline void uw(void)
{
CHECKA(writers, 1);
CHECKA(readers, 0);
atomic_dec(&writers);
up_write(&rwsem_sem);
}

static inline void dgw(void)
{
CHECKA(writers, 1);
CHECKA(readers, 0);
atomic_dec(&writers);
atomic_inc(&readers);
downgrade_write(&rwsem_sem);
atomic_inc(&downgrades_taken);
}

static inline void sched(void)
{
#ifdef SCHED
if (do_sched)
schedule();
#endif
}

int mutexer(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Mutex%u", N);

while (atomic_read(&do_stuff)) {
d();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
u();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&mx_comp[N], 0);
}

int reader(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Read%u", N);

while (atomic_read(&do_stuff)) {
dr();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
ur();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&rd_comp[N], 0);
}

int writer(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Write%u", N);

while (atomic_read(&do_stuff)) {
dw();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
uw();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&wr_comp[N], 0);
}

int downgrader(void *arg)
{
unsigned int N = (unsigned long) arg;

daemonize("Down%u", N);

while (atomic_read(&do_stuff)) {
dw();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
dgw();
#ifdef LOAD_TEST
if (load)
udelay(load);
#endif
ur();
sched();
}

if (verbose)
printk("%s: done\n", current->comm);
complete_and_exit(&dg_comp[N], 0);
}

static void stop_test(unsigned long dummy)
{
atomic_set(&do_stuff, 0);
}

/*****************************************************************************/
/*
*
*/
static int __init rwsem_init_module(void)
{
unsigned long loop;

if (verbose)
printk("\nrwsem_any starting tests...\n");

init_MUTEX(&mutex_sem);
init_rwsem(&rwsem_sem);
atomic_set(&do_stuff, 1);

#ifdef DEBUG_RWSEM
rwsem_to_monitor = &rwsem_sem;
#endif

/* kick off all the children */
for (loop = 0; loop < 20; loop++) {
if (loop < nummx) {
init_completion(&mx_comp[loop]);
kernel_thread(mutexer, (void *) loop, 0);
}

if (loop < numrd) {
init_completion(&rd_comp[loop]);
kernel_thread(reader, (void *) loop, 0);
}

if (loop < numwr) {
init_completion(&wr_comp[loop]);
kernel_thread(writer, (void *) loop, 0);
}

if (loop < numdg) {
init_completion(&dg_comp[loop]);
kernel_thread(downgrader, (void *) loop, 0);
}
}

/* set a stop timer */
init_timer(&timer);
timer.function = stop_test;
timer.expires = jiffies + elapse * HZ;
add_timer(&timer);

/* now wait until it's all done */
for (loop = 0; loop < nummx; loop++)
wait_for_completion(&mx_comp[loop]);

for (loop = 0; loop < numrd; loop++)
wait_for_completion(&rd_comp[loop]);

for (loop = 0; loop < numwr; loop++)
wait_for_completion(&wr_comp[loop]);

for (loop = 0; loop < numdg; loop++)
wait_for_completion(&dg_comp[loop]);

atomic_set(&do_stuff, 0);
del_timer(&timer);

/* print the results */
if (verbose) {
#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
printk("rwsem counter = %ld\n", rwsem_sem.count);
#else
printk("rwsem counter = %d\n", rwsem_sem.activity);
#endif

printk("mutexes taken: %d\n", atomic_read(&mutexes_taken));
printk("reads taken: %d\n", atomic_read(&reads_taken));
printk("writes taken: %d\n", atomic_read(&writes_taken));
printk("downgrades taken: %d\n", atomic_read(&downgrades_taken));
}
else {
printk("%3d %3d %3d %3d %c %3d %9d %9d %9d %9d\n",
nummx, numrd, numwr, numdg,
do_sched ? 's' : '-',
#ifdef LOAD_TEST
load,
#else
0,
#endif
atomic_read(&mutexes_taken),
atomic_read(&reads_taken),
atomic_read(&writes_taken),
atomic_read(&downgrades_taken));
}

#ifdef DEBUG_RWSEM
rwsem_to_monitor = NULL;
#endif

/* tell insmod to discard the module */
return -ENOANO;
} /* end rwsem_init_module() */

module_init(rwsem_init_module);

2005-04-29 00:40:39

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

Benjamin LaHaise writes:

> Please review the following series of patches for unifying the
> semaphore implementation across all architectures (not posted as
> they're about 350K), as they have only been tested on x86-64. The
> code generated is functionally identical to the earlier i386
> variant, but since gcc has no way of taking condition codes as
> results, there are two additional instructions inserted from the
> use of generic atomic operations. All told the >6000 lines of code
> deleted makes for a much easier job for subsequent patches changing
> semaphore functionality. Cheers,

Vetoed.

You have made semaphores bigger and slower on the architectures that
have load-linked/store-conditional instructions, which is at least
ppc, ppc64, sparc64 and alpha. Did you take the trouble to understand
the ppc semaphore implementation?

What changes do you want to make to the semaphore functionality?

Paul.

2005-04-29 00:42:56

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

to den 28.04.2005 Klokka 23:40 (+0100) skreiv Russell King:
> On Thu, Apr 28, 2005 at 02:29:26PM -0400, Benjamin LaHaise wrote:
> > Please review the following series of patches for unifying the
> > semaphore implementation across all architectures (not posted as
> > they're about 350K), as they have only been tested on x86-64. The
> > code generated is functionally identical to the earlier i386
> > variant, but since gcc has no way of taking condition codes as
> > results, there are two additional instructions inserted from the
> > use of generic atomic operations. All told the >6000 lines of code
> > deleted makes for a much easier job for subsequent patches changing
> > semaphore functionality. Cheers,
>
> I'm not sure why we're doing this, apart from a desire to unify stuff.

It started from a desire to extend the existing implementations to
support new features such as asynchronous notification. Currently that
sort of thing is impossible unless your developer-super-powers include
the ability to herd 24 different subsystem maintainers into working
together on a solution.

In other words, the main drive is the desire to make it maintainable.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-04-29 01:23:08

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

Trond Myklebust writes:

> It started from a desire to extend the existing implementations to
> support new features such as asynchronous notification. Currently that
> sort of thing is impossible unless your developer-super-powers include
> the ability to herd 24 different subsystem maintainers into working
> together on a solution.

Well, maybe the slow paths could be unified somewhat, and then these
extra features could be added in the slow paths. I would support
that. I certainly don't support replacing the current optimized
fast-path implementations with a lowest-common-denominator thing like
Ben was proposing.

Paul.

2005-04-29 05:33:34

by Richard Henderson

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On Fri, Apr 29, 2005 at 10:44:17AM +1000, Paul Mackerras wrote:
> You have made semaphores bigger and slower on the architectures that
> have load-linked/store-conditional instructions, which is at least
> ppc, ppc64, sparc64 and alpha.

And mips.

While sparc64 doesn't have ll/sc, it does have compare-and-swap and
it's trivial to use that exactly like we use ll/sc. S390 also has
compare-and-swap as its atomic primitive.

Seems to me that the ppc semaphore implementation is superior to the
i386 implementation that seems to have been propagated here. Indeed,
I might think it would help i486, ia64, and amd64 to use the ppc style
compare-and-swap instead of the existing implementation. Care would
have to be taken such that i386 still works, but I suspect the vast
majority of folk don't configure for that.

I would support two or three common implementations, but definitely
not the one implementation presented.


r~

2005-04-29 14:15:23

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On Fri, Apr 29, 2005 at 10:44:17AM +1000, Paul Mackerras wrote:
> You have made semaphores bigger and slower on the architectures that
> have load-linked/store-conditional instructions, which is at least
> ppc, ppc64, sparc64 and alpha. Did you take the trouble to understand
> the ppc semaphore implementation?

The ppc implementation does have some good ideas that are worth using.
It's hard to know which of the 23 versions were worth using, but I'm
getting a picture where at least 2 variants are need. The atomic ops
variant should use the single counter as ppc does (why did nobody port
that to x86?). A spinlock version is needed at least by parisc.

> What changes do you want to make to the semaphore functionality?

There are at least two users who need asynchronous semaphore/mutex
operations: aio_write (which needs to acquire i_sem), and nfs.
Changing 23 different architectures and verifying that they are
correct is next to impossible, so it makes sense to have at least
some unification take place.

-ben
--
"Time is what keeps everything from happening all at once." -- John Wheeler

2005-04-29 15:42:21

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

fr den 29.04.2005 Klokka 10:14 (-0400) skreiv Benjamin LaHaise:
> On Fri, Apr 29, 2005 at 10:44:17AM +1000, Paul Mackerras wrote:
> > You have made semaphores bigger and slower on the architectures that
> > have load-linked/store-conditional instructions, which is at least
> > ppc, ppc64, sparc64 and alpha. Did you take the trouble to understand
> > the ppc semaphore implementation?
>
> The ppc implementation does have some good ideas that are worth using.
> It's hard to know which of the 23 versions were worth using, but I'm
> getting a picture where at least 2 variants are need. The atomic ops
> variant should use the single counter as ppc does (why did nobody port
> that to x86?). A spinlock version is needed at least by parisc.

The PPC implementation would be hard to port to x86, since it relies on
the load-linked/store-conditional stuff to provide a fast primitive for
atomic_dec_if_positive().
The only way I found to implement that on x86 was to use cmpxchg. On my
machine, therefore, a spinlock-based semaphore implementation turns out
to be at least as fast for the "fast" path (and is naturally much more
efficient for the slow path).

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-04-30 01:45:14

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

Trond Myklebust writes:

> The PPC implementation would be hard to port to x86, since it relies on
> the load-linked/store-conditional stuff to provide a fast primitive for
> atomic_dec_if_positive().

The only fast path that needs atomic_dec_if_positive is down_trylock.
You can use atomic_dec for down_trylock instead; the only downside to
that is that if somebody was holding the semaphore but nobody was
waiting, the holder will take the slow path when it does the up. Or
you can use cmpxchg for down_trylock. I believe that down_trylock is
used much less than down and down_interruptible, so it shouldn't
matter if down_trylock is a few nanoseconds slower than down.

> The only way I found to implement that on x86 was to use cmpxchg. On my
> machine, therefore, a spinlock-based semaphore implementation turns out
> to be at least as fast for the "fast" path (and is naturally much more
> efficient for the slow path).

What is "your machine"? Is a single cmpxchg really slower than
locking and unlocking a spinlock? If so, by how much?

Paul.

2005-04-30 01:49:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

Benjamin LaHaise writes:

> There are at least two users who need asynchronous semaphore/mutex
> operations: aio_write (which needs to acquire i_sem), and nfs.

What do you mean by asynchronous semaphore/mutex operations? Are you
talking about a down operation that will acquire the semaphore if it
is free, but if not, arrange for a callback and return immediately, or
something?

Paul.

2005-04-30 05:13:25

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

I wrote:

> The only fast path that needs atomic_dec_if_positive is down_trylock.
> You can use atomic_dec for down_trylock instead; the only downside to
> that is that if somebody was holding the semaphore but nobody was
> waiting, the holder will take the slow path when it does the up.

Better still, on machines without ll/sc we can do down_trylock as:

return atomic_read(&sem->count) <= 0
|| atomic_add_negative(-1, &sem->count);

Then we will only take the slow path unnecessarily on up() if another
cpu decrements the count between the test and the atomic_add_negative,
which should be very rare. Doing the test will also avoid doing the
atomic op if the semaphore is already held.

Paul.

2005-04-30 16:40:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On lau , 2005-04-30 at 11:45 +1000, Paul Mackerras wrote:

> What is "your machine"? Is a single cmpxchg really slower than
> locking and unlocking a spinlock? If so, by how much?

Sorry. Thinking back, I realize that I was testing the non-irqsafe
spinlocks, since that was the only case that was of interest for the
iosem stuff. I should go back and test for the case of irqsafe ones.

FYI, though, the machine on which I tested it is a mobile P4. When
averaging over 1000 iterations, a single bus-locked cmpxchg took more
than twice the amount of time to complete than a single bus-locked incb
or decb (as used in the 386 spinlock implementations).
The spinlocked version was therefore not much faster for the fast path,
but for the slow path you do only a single spin_lock/spin_unlock
combination instead of cmpxchg+spinlock/spinunlock. It is therefore
twice as fast.

Cheers,
Trond

2005-04-30 16:51:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] unify semaphore implementations

On lau , 2005-04-30 at 11:49 +1000, Paul Mackerras wrote:
> Benjamin LaHaise writes:
>
> > There are at least two users who need asynchronous semaphore/mutex
> > operations: aio_write (which needs to acquire i_sem), and nfs.
>
> What do you mean by asynchronous semaphore/mutex operations? Are you
> talking about a down operation that will acquire the semaphore if it
> is free, but if not, arrange for a callback and return immediately, or
> something?

Yes. See the iosem patches I posted a couple of weeks ago.

Cheers,
Trond

--
Trond Myklebust <[email protected]>