2003-02-24 20:02:44

by Patrick Mansfield

[permalink] [raw]
Subject: 2.5.62-mm2 slow file system writes across multiple disks

Hi -

Running 2.5.62-mm2, I was trying to get multiple commands queued to
different scsi disks via writes to multiple file systems (each fs
on its own disk), but got rather low performance.

Are there any config options or settings I should change to improve the
performance?

Is this expected behaviour for now?

I'm mounting 10 disks using ext2 with noatime, starting 10 dd's in
parallel, with if=/dev/zero bs=128k count=1000, then umount-ing after each
dd completes.

All disks are attached to a single qlogic 2300, running with the feral
driver. Each disk can do about 20 mb/sec, the host adapter is limited to
about 100 mb/sec (fibre 1 gbit/sec).

Kernel is 2.5.62-mm2 with elevator=deadline on a 8 cpu NUMAQ box, all dd's
were run on cpu's 0-3 to try and avoid any cross-node (NUMA) affects.

Following are numbers with a hacked dd to open with O_DIRECT, then dd
without O_DIRECT, and then my config settings.


With O_DIRECT set, 10 dd's (writes) in parallel to ten different file
systems on ten different drives:

[patman@elm3b79 iostuff]$ vmstat 1 1000
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 7035300 229752 500000 0 0 2 15 26 1 0 0 0
0 0 0 0 7035300 229752 500000 0 0 0 0 1168 6 0 0 100
3 8 1 0 7029028 231328 500044 0 0 1594 3124 1426 1471 4 22 74
3 7 0 0 7028644 231440 500044 0 0 44 88324 1911 1632 0 7 93
4 7 0 0 7028644 231480 500044 0 0 0 89088 1873 1362 0 5 95
3 7 0 0 7028580 231520 500044 0 0 0 89216 1851 1364 0 5 94
3 7 0 0 7028580 231564 500044 0 0 0 88784 1889 1374 0 5 94
0 10 0 0 7028516 231612 500044 0 0 0 88448 1806 1363 0 5 95
2 8 0 0 7028452 231656 500044 0 0 0 87936 1796 1353 0 5 95
2 8 0 0 7028452 231696 500044 0 0 0 88960 1814 1343 0 5 94
2 8 0 0 7028388 231740 500044 0 0 0 89788 1861 1378 0 5 95
2 8 0 0 7028324 231780 500044 0 0 0 89216 1828 1379 0 5 95
2 8 0 0 7028324 231824 500044 0 0 0 88192 1834 1370 0 5 95
0 10 0 0 7028324 231872 500044 0 0 0 88832 1843 1393 0 5 95
2 8 0 0 7028260 231916 500044 0 0 4 88964 1847 1400 0 6 94
2 8 0 0 7028132 231968 500044 0 0 8 89088 1865 1398 0 5 95
3 8 0 0 7028132 232008 500044 0 0 0 89472 1871 1384 0 5 95
5 2 2 0 7028764 231584 500020 0 0 8 29372 1763 658 1 47 53
1 3 2 0 7031476 230084 500008 0 0 0 5952 1440 217 0 54 46
1 0 0 0 7033612 229784 500004 0 0 0 200 1277 101 0 3 96
0 0 0 0 7034444 229784 500004 0 0 0 0 1059 53 0 0 99

Total time of (for all dd's and umounts to complete, I can post times for
individual dd if wanted):

0.39user 11.82system 0:22.34elapsed 54%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (12790major+5402minor)pagefaults 0swaps



With O_DIRECT NOT set running 10 dd's (writes) in parallel to ten
different file systems on ten different drives on one host gives me:

[patman@elm3b79 iostuff]$ vmstat 1 10000
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 7070668 222200 471956 0 0 2 16 55 1 0 0 28
0 0 0 0 7070668 222200 471956 0 0 0 0 1202 6 0 0 100
1 0 0 0 7069948 222204 471948 0 0 0 28 1034 32 1 1 98
10 1 2 0 6893724 224036 640644 0 0 1634 664 1470 1637 4 55 41
10 0 0 0 6542300 224388 986092 0 0 12 164 1312 151 0 52 48
10 0 3 0 6206700 224640 1314784 0 0 0 180412 1314 111 0 56 43
5 5 1 0 5916124 224884 1599212 0 0 4 306644 1366 208 1 53 46
0 10 1 0 5805612 224972 1707016 0 0 0 147996 1364 99 1 20 79
0 10 1 0 5805580 224972 1707016 0 0 0 29216 1348 20 0 1 98
1 9 1 0 5768996 225008 1743032 0 0 0 30232 1359 28 0 6 94
0 10 1 0 5759676 225032 1751992 0 0 0 28260 1361 57 0 3 97
0 10 1 0 5759676 225032 1751992 0 0 0 24 1344 8 0 1 99
0 10 1 0 5759676 225032 1751992 0 0 0 14352 1357 20 0 1 99
0 10 1 0 5759676 225032 1751992 0 0 0 11272 1346 14 0 1 99

[ most dd's have completed here, just umount now ]

0 10 1 0 5889108 225032 1623988 0 0 0 18964 1709 28 0 11 89
0 9 1 0 6018916 224744 1495984 0 0 0 70668 1408 72 0 14 86
0 9 1 0 6018788 224744 1495984 0 0 0 83080 1422 52 0 3 97
0 9 1 0 6018636 224744 1495984 0 0 0 77380 1431 52 0 3 97
0 9 1 0 6018348 224744 1495984 0 0 0 77904 1431 52 0 3 97
1 8 1 0 6018260 224756 1495984 0 0 0 52824 1428 60 0 2 98
0 9 1 0 6018276 224756 1495984 0 0 0 18420 1402 21 0 2 98
0 9 1 0 6018292 224756 1495984 0 0 0 12276 1424 22 0 1 99
0 9 1 0 6018308 224756 1495984 0 0 0 7176 1129 14 0 1 99
0 9 1 0 6018380 224756 1495984 0 0 0 3588 1124 12 0 1 99
0 8 1 0 6148668 224468 1367980 0 0 0 14356 1568 30 0 16 84
0 7 1 0 6539564 224176 983968 0 0 0 16908 1372 88 0 32 68
0 4 1 0 6542172 223440 983968 0 0 0 7176 1256 75 0 10 90
0 4 1 0 6672124 223440 855964 0 0 0 7688 1169 16 0 7 92
0 4 1 0 6672540 223440 855964 0 0 0 17940 1255 16 0 1 99
0 3 1 0 6673452 223164 855964 0 0 0 6720 1086 52 0 3 97
0 3 1 0 6673580 223164 855964 0 0 0 0 1070 6 0 0 99
0 3 1 0 6803732 223164 727960 0 0 0 11280 1079 14 0 5 95
1 2 1 0 6804404 223000 727960 0 0 0 7684 1194 18 0 4 96
0 2 1 0 6804860 223000 727960 0 0 0 14352 1217 25 0 1 99
0 2 1 0 6804940 223016 727960 0 0 0 7192 1073 32 0 0 100
0 2 1 0 6935020 223016 599956 0 0 0 40 1056 16 0 6 94
0 1 1 0 6936092 222720 599956 0 0 0 0 1053 23 0 2 97
0 1 1 0 6936092 222720 599956 0 0 0 0 1045 6 0 0 100
0 1 1 0 6936276 222720 599956 0 0 0 0 1050 6 0 0 99
0 1 1 0 6936284 222720 599956 0 0 0 0 1048 6 0 0 100
0 0 3 0 7066996 222428 471952 0 0 0 184 1057 35 0 6 94
0 0 0 0 7067748 222436 471952 0 0 0 8 1064 41 0 0 100
0 0 0 0 7067892 222436 471952 0 0 0 0 1026 8 0 0 100


Total time for all 10 dd's and umounts:

0.49user 25.96system 0:45.25elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (12790major+5399minor)pagefaults 0swaps


config options set:

CONFIG_X86=y
CONFIG_MMU=y
CONFIG_SWAP=y
CONFIG_UID16=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_EXPERIMENTAL=y
CONFIG_SYSVIPC=y
CONFIG_SYSCTL=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_OBSOLETE_MODPARM=y
CONFIG_KMOD=y
CONFIG_X86_NUMAQ=y
CONFIG_MPENTIUMIII=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_PREFETCH=y
CONFIG_SMP=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_NR_CPUS=32
CONFIG_NUMA=y
CONFIG_DISCONTIGMEM=y
CONFIG_HAVE_ARCH_BOOTMEM_NODE=y
CONFIG_X86_CPUID=y
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y
CONFIG_X86_PAE=y
CONFIG_HIGHPTE=y
CONFIG_HAVE_DEC_LOCK=y
CONFIG_PCI=y
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_LEGACY_PROC=y
CONFIG_PCI_NAMES=y
CONFIG_ISA=y
CONFIG_HOTPLUG=y
CONFIG_KCORE_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=y
CONFIG_PNP=y
CONFIG_BLK_DEV_FD=y
CONFIG_BLK_DEV_LOOP=y
CONFIG_LBD=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=y
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_REPORT_LUNS=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
CONFIG_SCSI_QLOGIC_ISP_NEW=y
CONFIG_SCSI_DEBUG=m
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IPV6_SCTP__=y
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_NET_ETHERNET=y
CONFIG_NET_TULIP=y
CONFIG_TULIP=y
CONFIG_TULIP_MWI=y
CONFIG_TULIP_MMIO=y
CONFIG_NET_PCI=y
CONFIG_ADAPTEC_STARFIRE=y
CONFIG_INPUT=y
CONFIG_SOUND_GAMEPORT=y
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_SERIAL=y
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=256
CONFIG_RAW_DRIVER=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_JBD=y
CONFIG_TMPFS=y
CONFIG_RAMFS=y
CONFIG_ISO9660_FS=y
CONFIG_PROC_FS=y
CONFIG_DEVPTS_FS=y
CONFIG_EXT2_FS=y
CONFIG_FS_MBCACHE=y
CONFIG_MSDOS_PARTITION=y
CONFIG_VIDEO_SELECT=y
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_KALLSYMS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_X86_EXTRA_IRQS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_CRC32=y
CONFIG_X86_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y

-- Patrick Mansfield


2003-02-24 21:46:15

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

Patrick Mansfield <[email protected]> wrote:
>
> Hi -
>
> Running 2.5.62-mm2, I was trying to get multiple commands queued to
> different scsi disks via writes to multiple file systems (each fs
> on its own disk), but got rather low performance.
>
> Are there any config options or settings I should change to improve the
> performance?
>
> Is this expected behaviour for now?
>
> I'm mounting 10 disks using ext2 with noatime, starting 10 dd's in
> parallel, with if=/dev/zero bs=128k count=1000, then umount-ing after each
> dd completes.

Could be that concurrent umount isn't a good way of getting scalable
writeout; I can't say that I've ever looked...

Could you try putting a `sync' in there somewhere?

Or even better, throw away dd and use write-and-fsync from ext3 CVS. Give it
the -f flag to force an fsync against each file as it is closed.

http://www.zip.com.au/~akpm/linux/ext3/


2003-02-25 01:47:12

by Patrick Mansfield

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

On Mon, Feb 24, 2003 at 01:53:23PM -0800, Andrew Morton wrote:

> Could be that concurrent umount isn't a good way of getting scalable
> writeout; I can't say that I've ever looked...
>
> Could you try putting a `sync' in there somewhere?
>
> Or even better, throw away dd and use write-and-fsync from ext3 CVS. Give it
> the -f flag to force an fsync against each file as it is closed.
>
> http://www.zip.com.au/~akpm/linux/ext3/

Using fsync didn't seem to make any difference.

I moved to 2.5.62-mm3 [I had to drop back to qlogicisp for my boot disk,
and run the feral drver as a module in order to boot without hanging], and
ran write-and-fsync with -f, with and without -o (O_DIRECT).

What keeps pdflush running when get_request_wait sleeps? I thought there
was (past tense) some creation of multiple pdflushes to handle such cases.

Here are more details:

[patman@elm3b79 iostuff]$ cat /proc/cmdline
BOOT_IMAGE=2562-mm3-1 ro root=801 BOOT_FILE=/boot/vmlinuz-2562-mm3 console=tty0 console=ttyS0,38400 notsc elevator=deadline

Write 10 100 mb files to 10 different disks, write-and-fsync with -f and -o:

[patman@elm3b79 iostuff]$ vmstat 1 10000
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 8070944 12756 21344 0 0 2 713 162 10 0 1 99
0 0 0 0 8070944 12756 21344 0 0 0 0 1123 6 0 0 100
0 0 0 0 8070944 12756 21344 0 0 0 0 1026 4 0 0 100
0 10 0 0 8069840 12756 21344 0 0 0 39936 1149 194 1 4 95
0 10 0 0 8069816 12772 21344 0 0 0 101500 1502 221 0 4 96
0 10 0 0 8069816 12772 21344 0 0 0 83968 1200 167 0 4 96
0 10 0 0 8069816 12772 21344 0 0 0 80896 1197 164 0 4 96
0 10 0 0 8069816 12772 21344 0 0 0 83968 1192 169 0 4 96
0 10 1 0 8069816 12776 21344 0 0 0 84996 1206 178 0 4 96
0 10 0 0 8069816 12776 21344 0 0 0 82944 1191 167 0 4 96
0 10 0 0 8069816 12776 21344 0 0 0 80896 1195 164 0 4 96
0 10 0 0 8069816 12776 21344 0 0 0 83968 1196 170 0 4 96
0 10 0 0 8069816 12776 21344 0 0 0 82944 1201 165 0 4 96
0 10 1 0 8069808 12776 21344 0 0 0 86144 1195 173 0 4 95
0 10 0 0 8069744 12776 21344 0 0 0 70896 1389 149 0 4 96
0 2 0 0 8070320 12776 21344 0 0 0 52420 1183 218 0 2 97
0 1 0 0 8070464 12776 21344 0 0 0 10244 1051 33 0 0 100
0 0 0 0 8071016 12776 21344 0 0 0 36 1044 42 0 0 100
0 0 1 0 8071016 12776 21344 0 0 0 4 1026 8 0 0 100


Total elapsed time of the 10 writers:

0.03user 1.83system 0:14.13elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1986major+3228minor)pagefaults 0swaps


Write 10 100 mb files to 10 different disks, write-and-fsync with -f (not
O_DIRECT):

[patman@elm3b79 iostuff]$ vmstat 1 10000
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 8070976 12920 21344 0 0 2 745 162 10 0 1 99
0 0 0 0 8070976 12920 21344 0 0 0 0 1153 4 0 0 100
0 0 0 0 8070912 12928 21344 0 0 0 28 1028 12 0 0 100
10 0 0 0 8054000 12928 37416 0 0 0 0 1063 114 0 5 95
10 0 0 0 7767944 12928 322948 0 0 0 0 1023 41 0 50 50
10 0 0 0 7482504 12928 604124 0 0 0 0 1025 31 0 50 50
6 4 0 0 7222992 12928 858380 0 0 0 342668 1056 80 0 55 45
0 10 1 0 7032160 12944 1045348 0 0 0 271940 1083 134 0 39 61
0 10 1 0 7032048 12944 1045348 0 0 0 22040 1079 42 0 2 97
0 10 1 0 7031832 12944 1045348 0 0 0 22544 1081 42 0 2 98
0 10 1 0 7031800 12944 1045348 0 0 0 8120 1075 32 0 1 99
0 10 1 0 7031800 12944 1045348 0 0 0 0 1073 92 0 1 99
0 10 1 0 7031800 12944 1045348 0 0 0 0 1072 96 0 1 99
0 10 1 0 7031800 12944 1045348 0 0 0 0 1071 88 0 1 99
0 10 1 0 7031816 12944 1045348 0 0 0 3588 1078 98 0 1 99
0 10 1 0 7031856 12944 1045348 0 0 0 18452 1093 96 0 2 98
0 9 1 0 7031960 12944 1045348 0 0 0 11788 1075 58 0 2 98
0 8 1 0 7032048 12944 1045348 0 0 0 36388 1082 55 0 2 97
0 8 1 0 7032000 12944 1045348 0 0 0 25624 1085 40 0 2 98
0 8 1 0 7031832 12944 1045348 0 0 0 25576 1098 42 0 2 98
0 8 1 0 7031760 12944 1045348 0 0 0 22040 1084 58 0 2 98
0 8 1 0 7031720 12944 1045348 0 0 0 23060 1084 60 0 2 98
0 8 1 0 7031664 12944 1045348 0 0 0 10244 1089 60 0 1 98
0 8 1 0 7031672 12944 1045348 0 0 0 18960 1095 70 0 2 98
0 8 1 0 7031656 12944 1045348 0 0 0 40124 1135 108 0 2 97
0 8 1 0 7031672 12944 1045348 0 0 0 37700 1136 92 0 3 97
0 7 1 0 7031672 12944 1045348 0 0 0 22972 1140 113 0 3 97
0 7 1 0 7031688 12944 1045348 0 0 0 17944 1144 163 0 3 97
0 7 1 0 7031696 12944 1045348 0 0 0 23008 1152 183 0 2 97
0 6 1 0 7031792 12944 1045348 0 0 0 18932 1146 177 0 3 97
0 6 1 0 7031816 12944 1045348 0 0 0 1496 1133 176 0 2 98
0 5 0 0 7032040 12944 1045348 0 0 0 0 1127 34 0 2 98
0 4 0 0 7032312 12944 1045348 0 0 0 44 1131 31 0 1 99
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 3 0 0 7032688 12944 1045348 0 0 0 8 1098 26 0 1 99
0 3 0 0 7033096 12944 1045348 0 0 0 0 1085 18 0 1 99
0 1 0 0 7033808 12944 1045348 0 0 0 0 1066 24 0 1 99

Total elpased time:

0.04user 14.80system 0:33.35elapsed 44%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1986major+3228minor)pagefaults 0swaps

-- Patrick Mansfield

2003-02-25 04:32:53

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

Patrick Mansfield <[email protected]> wrote:
>
> On Mon, Feb 24, 2003 at 01:53:23PM -0800, Andrew Morton wrote:
>
> > Could be that concurrent umount isn't a good way of getting scalable
> > writeout; I can't say that I've ever looked...
> >
> > Could you try putting a `sync' in there somewhere?
> >
> > Or even better, throw away dd and use write-and-fsync from ext3 CVS. Give it
> > the -f flag to force an fsync against each file as it is closed.
> >
> > http://www.zip.com.au/~akpm/linux/ext3/
>
> Using fsync didn't seem to make any difference.

Something is up.

> I moved to 2.5.62-mm3 [I had to drop back to qlogicisp for my boot disk,
> and run the feral drver as a module in order to boot without hanging], and
> ran write-and-fsync with -f, with and without -o (O_DIRECT).

Is this using an enormous request queue or really deep TCQ or something?
I always turn TCQ off, stupid noxious thing it is.

> What keeps pdflush running when get_request_wait sleeps? I thought there
> was (past tense) some creation of multiple pdflushes to handle such cases.

When searching for pages to write out, pdflush will skip over superblocks which
are backed by queues which are currently under write congestion. Once pdflush
has walked all superblocks, it will take a little nap, waiting for some write
requests to be put back and then it will take another search of the dirty superblocks.

So pdflush can keep many queues busy, and should never wait in get_request_wait().

There are some little accident scenarios which could cause pdflush to get stuck,
usually in a read from disk. But having more than just a single instance is
probably overkill. I don't have enough disks to know, really. 17 in this box,
but not enough scsi/pci bandwidth to feed them all.

> Here are more details:

Something is really up.

When I do

for i in sda5 sdb5 sdf5 sdg5 hde5 hdg5
do
time write-and-fsync -f -m 100 /mnt/$i/1 &
done

I get:

write-and-fsync -f -m 100 /mnt/$i/1 0.00s user 1.98s system 59% cpu 3.344 total
write-and-fsync -f -m 100 /mnt/$i/1 0.00s user 2.12s system 60% cpu 3.519 total
write-and-fsync -f -m 100 /mnt/$i/1 0.00s user 1.51s system 20% cpu 7.430 total
write-and-fsync -f -m 100 /mnt/$i/1 0.00s user 1.67s system 20% cpu 7.975 total
write-and-fsync -f -m 100 /mnt/$i/1 0.00s user 1.38s system 15% cpu 8.785 total
write-and-fsync -f -m 100 /mnt/$i/1 0.00s user 1.44s system 12% cpu 11.611 total

and

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 7196732 10668 8580 0 0 6 446 257 8 0 1 96 2
0 0 0 7196732 10668 8580 0 0 0 0 1160 6 0 0 100 0
0 0 0 7196732 10668 8580 0 0 0 0 1026 4 0 0 100 0
8 0 0 7089060 10796 115476 0 0 0 0 1044 57 0 34 65 0
10 0 0 6837660 11044 352880 0 0 0 39848 1250 32 0 100 0 0
9 0 0 6692308 11168 482548 0 0 0 134964 1517 45 0 100 0 0
0 7 0 6542812 11300 622980 0 0 8 294936 1560 89 0 79 0 21
0 5 0 6572796 11300 622980 0 0 0 74708 1484 556 0 7 5 88
0 5 0 6572860 11300 622980 0 0 0 38904 1170 101 0 1 25 74
0 5 0 6572892 11300 622980 0 0 0 20968 1170 203 0 1 24 75
0 4 0 6573324 11300 622980 0 0 0 7180 1178 277 0 1 24 74
0 2 0 6573580 11300 622980 0 0 0 3544 1143 174 0 0 30 69
0 1 0 6573948 11300 622980 0 0 0 4 1079 18 0 0 69 31
0 1 0 6573996 11300 622980 0 0 0 0 1058 5 0 0 75 25

100M isn't a lot. Using 1G lets things settle out better.


So sorry, don't know. Badari had a setup like that happily sustaining 180 MB/sec
a while back.


2003-02-25 19:24:44

by Patrick Mansfield

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

On Mon, Feb 24, 2003 at 08:43:21PM -0800, Andrew Morton wrote:
> Patrick Mansfield <[email protected]> wrote:

> > I moved to 2.5.62-mm3 [I had to drop back to qlogicisp for my boot disk,
> > and run the feral drver as a module in order to boot without hanging], and
> > ran write-and-fsync with -f, with and without -o (O_DIRECT).

> Is this using an enormous request queue or really deep TCQ or something?
> I always turn TCQ off, stupid noxious thing it is.

Yes - pretty high, the feral driver is defaulting to 63 (comments there
say "FIX LATER"). Changing it to 8 gave much improved performance.

I'm on a slightly different kernel, anyway:

10 fsync writes of 200 mb to 10 separate disks, queue depth (TCQ) of 63 gave:

0.05user 37.14system 1:09.39elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1986major+3228minor)pagefaults 0swaps

10 fsync writes of 200 mb to 10 separate disks, queue depth of 8 gave:

0.05user 38.56system 0:32.71elapsed 118%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1986major+3228minor)pagefaults 0swaps

The total time dropped by almost half. vmstat numbers were much smoother.

It would be nice if a larger queue depth did not kill performance.

The larger queue depths can be nice for disk arrays with lots of cache and
(more) random IO patterns.

-- Patrick Mansfield

2003-02-26 08:34:11

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

Patrick Mansfield <[email protected]> wrote:
>
> On Mon, Feb 24, 2003 at 08:43:21PM -0800, Andrew Morton wrote:
> > Patrick Mansfield <[email protected]> wrote:
>
> > > I moved to 2.5.62-mm3 [I had to drop back to qlogicisp for my boot disk,
> > > and run the feral drver as a module in order to boot without hanging], and
> > > ran write-and-fsync with -f, with and without -o (O_DIRECT).
>
> > Is this using an enormous request queue or really deep TCQ or something?
> > I always turn TCQ off, stupid noxious thing it is.
>
> Yes - pretty high, the feral driver is defaulting to 63 (comments there
> say "FIX LATER"). Changing it to 8 gave much improved performance.
>
> I'm on a slightly different kernel, anyway:
>
> 10 fsync writes of 200 mb to 10 separate disks, queue depth (TCQ) of 63 gave:
>
> 0.05user 37.14system 1:09.39elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (1986major+3228minor)pagefaults 0swaps
>
> 10 fsync writes of 200 mb to 10 separate disks, queue depth of 8 gave:
>
> 0.05user 38.56system 0:32.71elapsed 118%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (1986major+3228minor)pagefaults 0swaps
>
> The total time dropped by almost half. vmstat numbers were much smoother.

Damn, I wonder what's up with that. I tried to reproduce this with an
Adaptec controller, but there did not appear to be a significant difference
between zero tags and 64 tags. Nor was there an appreciable difference between
pagecache writeout and O_DIRECT writeout.

> It would be nice if a larger queue depth did not kill performance.

Does the other qlogic driver exhibit the same thing?

Does writeout to a single disk exhibit the same thing?

> The larger queue depths can be nice for disk arrays with lots of cache and
> (more) random IO patterns.

So says the scsi lore ;) Have you observed this yourself? Have you
any numbers handy?

2003-02-26 12:05:43

by Helge Hafting

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

Andrew Morton wrote:
> Patrick Mansfield <[email protected]> wrote:
[...]
>
>>The larger queue depths can be nice for disk arrays with lots of cache and
>>(more) random IO patterns.
>
>
> So says the scsi lore ;) Have you observed this yourself? Have you
> any numbers handy?

I believe deep queues might work well if the drives did their own
anticipatory scheduling. After all, they know both true geometry and
excact rotational position and latency.
But current drives aren't that good so all the deep queue achieves is
extra seeks which sometimes kills performance. The fix is simple -
shorten the queues. Long ones aren't really a goal by itself -
performance is.

Helge Hafting

2003-02-26 17:07:52

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

Andrew Morton [[email protected]] wrote:
>
> > It would be nice if a larger queue depth did not kill performance.
>
> Does the other qlogic driver exhibit the same thing?

Well the qlogic provided driver should exhibit slightly different
behavior as its per device queue depth is 16 and the request ring count
is 128.

The feral driver is currently running a per device queue of 63 and a
request ring size of 64. (if I am reading the driver correctly?). When
the request count is exceeded I believe it should return a 1 to the call
of queuecommand. Can you tell if scsi_queue_insert is being called.

> > The larger queue depths can be nice for disk arrays with lots of cache and
> > (more) random IO patterns.
>
> So says the scsi lore ;) Have you observed this yourself? Have you
> any numbers handy?

I do not have current numbers but you need to get the command on the
wire to get any benefit.

-andmike
--
Michael Anderson
[email protected]

2003-02-28 01:57:05

by Patrick Mansfield

[permalink] [raw]
Subject: Re: 2.5.62-mm2 slow file system writes across multiple disks

On Wed, Feb 26, 2003 at 12:44:54AM -0800, Andrew Morton wrote:

> Does the other qlogic driver exhibit the same thing?

OK I finally tried out the qlogic driver on the same 10 drives, actually
with scsi-misc-2.5 (2.5.63).

The qlogic is OK performance wise - as Mike pointed out, it sets a lower
queue depth; and even though it sets can_queue higher than the feral, the
qlogic driver software queues where in the same case the feral would give
us a host busy (i.e. queuecomand returns 1).

andmike> Well the qlogic provided driver should exhibit slightly different
andmike> behavior as its per device queue depth is 16 and the request ring count
andmike> is 128.
andmike>
andmike> The feral driver is currently running a per device queue of 63 and a
andmike> request ring size of 64. (if I am reading the driver correctly?). When
andmike> the request count is exceeded I believe it should return a 1 to the call
andmike> of queuecommand. Can you tell if scsi_queue_insert is being called.

As Mike implies, the feral driver is setting can_queue too high, so in
addition to large queue depth affects, I am also hitting scsi host busy
code paths - yes, it is calling scsi_queue_insert. The host busy code is
not meant to be hit so often, and likely leads to lower performance.

So the feral driver needs lower can_queue (and/or queueing changes) and
lower queue_depth limits.

> Does writeout to a single disk exhibit the same thing?

No, single disk IO performance is OK (with queue depth/TCQ 63 and
can_queue 744), so the too high can_queue with host busy's is probably
hurting performance than the high queue_depth.

> > The larger queue depths can be nice for disk arrays with lots of cache and
> > (more) random IO patterns.
>
> So says the scsi lore ;) Have you observed this yourself? Have you
> any numbers handy?

No and no :(

I'm not sure if the disk arrays we have available have enough memory to
show such affects (I assume standard disk caches are not large enough to
have much of an affect).

-- Patrick Mansfield