Hi,
Andries Brouwer recently submitted 32 bit dev_t patches,
which are in 2.5.65-mm2. This patch applies on those patches to support
more than 256 disks. This is for playing only.
I tested this with 4000 disks using scsi_debug. I attached my actual
disks (50) after 4000 scsi_debug disks. I am able to access my disks
fine and do IO on them.
Problems (so far):
1) sd.c - sd_index_bits[] arrys became big - need to be fixed.
2) 4000 disks eats up lots of low memory (~460 MB). Here is the
/proc/meminfo output before & after insmod.
Before:
MemTotal: 3883276 kB
MemFree: 3808028 kB
Buffers: 3240 kB
Cached: 41860 kB
SwapCached: 0 kB
Active: 45360 kB
Inactive: 7288 kB
HighTotal: 3014616 kB
HighFree: 2961856 kB
LowTotal: 868660 kB
LowFree: 846172 kB
SwapTotal: 2040244 kB
SwapFree: 2040244 kB
Dirty: 192 kB
Writeback: 0 kB
Mapped: 14916 kB
Slab: 7164 kB
Committed_AS: 12952 kB
PageTables: 312 kB
ReverseMaps: 1895
====
After:
MemTotal: 3883276 kB
MemFree: 3224140 kB
Buffers: 3880 kB
Cached: 140376 kB
SwapCached: 0 kB
Active: 47512 kB
Inactive: 105508 kB
HighTotal: 3014616 kB
HighFree: 2838144 kB
LowTotal: 868660 kB
LowFree: 385996 kB
SwapTotal: 2040244 kB
SwapFree: 2040244 kB
Dirty: 92 kB
Writeback: 0 kB
Mapped: 16172 kB
Slab: 464364 kB
Committed_AS: 14996 kB
PageTables: 412 kB
ReverseMaps: 2209
Badari Pulavarty wrote:
> Hi,
>
> Andries Brouwer recently submitted 32 bit dev_t patches,
> which are in 2.5.65-mm2. This patch applies on those patches to support
> more than 256 disks. This is for playing only.
>
> I tested this with 4000 disks using scsi_debug. I attached my actual
> disks (50) after 4000 scsi_debug disks. I am able to access my disks
> fine and do IO on them.
>
> Problems (so far):
>
> 1) sd.c - sd_index_bits[] arrys became big - need to be fixed.
>
> 2) 4000 disks eats up lots of low memory (~460 MB). Here is the
> /proc/meminfo output before & after insmod.
>
> Before:
> MemTotal: 3883276 kB
> MemFree: 3808028 kB
> Buffers: 3240 kB
> Cached: 41860 kB
> SwapCached: 0 kB
> Active: 45360 kB
> Inactive: 7288 kB
> HighTotal: 3014616 kB
> HighFree: 2961856 kB
> LowTotal: 868660 kB
> LowFree: 846172 kB
> SwapTotal: 2040244 kB
> SwapFree: 2040244 kB
> Dirty: 192 kB
> Writeback: 0 kB
> Mapped: 14916 kB
> Slab: 7164 kB
> Committed_AS: 12952 kB
> PageTables: 312 kB
> ReverseMaps: 1895
> ====
> After:
> MemTotal: 3883276 kB
> MemFree: 3224140 kB
> Buffers: 3880 kB
> Cached: 140376 kB
> SwapCached: 0 kB
> Active: 47512 kB
> Inactive: 105508 kB
> HighTotal: 3014616 kB
> HighFree: 2838144 kB
> LowTotal: 868660 kB
> LowFree: 385996 kB
> SwapTotal: 2040244 kB
> SwapFree: 2040244 kB
> Dirty: 92 kB
> Writeback: 0 kB
> Mapped: 16172 kB
> Slab: 464364 kB
> Committed_AS: 14996 kB
> PageTables: 412 kB
> ReverseMaps: 2209
Badari,
I poked around looking for data on the size issue.
Here are the byte sizes for the per device and per host
structures in scsi_debug and the scsi mid level for
i386, non-smp in lk 2.5.65:
sizeof(sdebug_dev_info)=60, sizeof(scsi_device)=376
sizeof(sdebug_host_info)=24, sizeof(Scsi_Host)=224
So for 4000 disks they should be responsible for about
2 MB.
The scsi_cmd_cache slab info went from this before those
84 pseudo disks were added:
# cat slabinfo_pre.txt
scsi_cmd_cache 3 11 356 1 1 1 : 32 16 :
22 301 6 5 0 0 43 : 3235 31 3264 0
to this afterwards:
# cat slabinfo_post.txt
scsi_cmd_cache 44 55 356 5 5 1 : 32 16 :
66 398 12 7 0 0 43 : 5837 40 5833 0
I did notice a rather large growth of nodes
in sysfs. For 84 added scsi_debug pseudo disks the number
of sysfs nodes went from 686 to 3347.
Does anybody know what is the per node memory cost of sysfs?
Doug Gilbert
Douglas Gilbert <[email protected]> wrote:
>
> > Slab: 464364 kB
It's all in slab.
> I did notice a rather large growth of nodes
> in sysfs. For 84 added scsi_debug pseudo disks the number
> of sysfs nodes went from 686 to 3347.
>
> Does anybody know what is the per node memory cost of sysfs?
Let's see all of /pro/slabinfo please.
slabinfo - version: 1.2 (statistics)
hpsb_packet 0 0 132 0 0 1 : 32 16 : 16 16 1 1 0 0 61 : 46 1 47 0
rpc_buffers 8 8 2048 4 4 1 : 32 16 : 8 8 4 0 0 0 34 : 4 4 0 0
rpc_tasks 8 19 204 1 1 1 : 32 16 : 16 16 1 0 0 0 51 : 7 1 0 0
rpc_inode_cache 0 0 480 0 0 1 : 32 16 : 0 0 0 0 0 0 40 : 0 0 0 0
unix_sock 10 21 576 3 3 1 : 32 16 : 21 32 3 0 0 0 39 : 593 4 587 0
tcp_tw_bucket 0 0 100 0 0 1 : 32 16 : 0 0 0 0 0 0 71 : 0 0 0 0
tcp_bind_bucket 7 125 28 1 1 1 : 32 16 : 16 16 1 0 0 0 157 : 6 1 0 0
tcp_open_request 0 0 68 0 0 1 : 32 16 : 0 0 0 0 0 0 88 : 0 0 0 0
inet_peer_cache 0 0 52 0 0 1 : 32 16 : 0 0 0 0 0 0 104 : 0 0 0 0
secpath_cache 0 0 32 0 0 1 : 32 16 : 0 0 0 0 0 0 144 : 0 0 0 0
flow_cache 0 0 60 0 0 1 : 32 16 : 0 0 0 0 0 0 94 : 0 0 0 0
xfrm4_dst_cache 0 0 220 0 0 1 : 32 16 : 0 0 0 0 0 0 50 : 0 0 0 0
ip_fib_hash 10 125 28 1 1 1 : 32 16 : 16 16 1 0 0 0 157 : 10 1 1 0
ip_dst_cache 5 19 208 1 1 1 : 32 16 : 19 63 1 0 0 0 51 : 3 4 2 0
arp_cache 2 21 188 1 1 1 : 32 16 : 17 32 1 0 0 0 53 : 0 2 0 0
raw4_sock 0 0 576 0 0 1 : 32 16 : 7 14 2 2 0 0 39 : 4 2 6 0
udp_sock 7 7 576 1 1 1 : 32 16 : 14 17 2 1 0 0 39 : 37 3 33 0
tcp_sock 16 21 1088 3 3 2 : 32 16 : 21 30 3 0 0 0 39 : 49 6 39 0
scsi_cmd_cache 66 66 356 6 6 1 : 32 16 : 66 220 9 3 0 0 43 : 5735 22 5713 0
nfs_write_data 36 40 452 5 5 1 : 32 16 : 40 40 5 0 0 0 40 : 31 5 0 0
nfs_read_data 32 36 436 4 4 1 : 32 16 : 36 36 4 0 0 0 41 : 28 4 0 0
nfs_inode_cache 0 0 704 0 0 2 : 32 16 : 0 0 0 0 0 0 43 : 0 0 0 0
nfs_page 0 0 108 0 0 1 : 32 16 : 0 0 0 0 0 0 68 : 0 0 0 0
isofs_inode_cache 0 0 440 0 0 1 : 32 16 : 0 0 0 0 0 0 41 : 0 0 0 0
ext2_inode_cache 4 7 576 1 1 1 : 32 16 : 7 7 1 0 0 0 39 : 3 1 0 0
journal_head 21 62 60 1 1 1 : 32 16 : 420 1124 9 2 0 1 94 : 6048 73 6046 55
revoke_table 1 144 24 1 1 1 : 32 16 : 16 16 1 0 0 0 176 : 0 1 0 0
revoke_record 0 0 28 0 0 1 : 32 16 : 16 32 2 2 0 0 157 : 3 2 5 0
ext3_inode_cache 2275 2275 576 325 325 1 : 32 16 : 2275 2290 325 0 0 0 39 : 1969 331 28 0
ext3_xattr 0 0 60 0 0 1 : 32 16 : 0 0 0 0 0 0 94 : 0 0 0 0
eventpoll_pwq 0 0 48 0 0 1 : 32 16 : 0 0 0 0 0 0 109 : 0 0 0 0
eventpoll_epi 0 0 72 0 0 1 : 32 16 : 0 0 0 0 0 0 85 : 0 0 0 0
kioctx 0 0 268 0 0 1 : 32 16 : 0 0 0 0 0 0 46 : 0 0 0 0
kiocb 0 0 172 0 0 1 : 32 16 : 0 0 0 0 0 0 55 : 0 0 0 0
dnotify_cache 0 0 32 0 0 1 : 32 16 : 0 0 0 0 0 0 144 : 0 0 0 0
file_lock_cache 17 30 128 1 1 1 : 32 16 : 30 173 1 0 0 0 62 : 1093 13 1089 0
fasync_cache 0 0 28 0 0 1 : 32 16 : 0 0 0 0 0 0 157 : 0 0 0 0
shmem_inode_cache 2 7 576 1 1 1 : 32 16 : 7 13 1 0 0 0 39 : 0 2 0 0
idr_layer_cache 0 0 148 0 0 1 : 32 16 : 0 0 0 0 0 0 58 : 0 0 0 0
posix_timers_cache 0 0 136 0 0 1 : 32 16 : 0 0 0 0 0 0 60 : 0 0 0 0
uid_cache 5 101 36 1 1 1 : 32 16 : 18 32 1 0 0 0 133 : 3 2 0 0
sgpool-MAX_PHYS_SEGMENTS 32 32 2048 16 16 1 : 32 16 : 34 38 18 2 0 0 34 : 0 35 3 0
sgpool-64 32 32 1024 8 8 1 : 32 16 : 36 48 12 4 0 0 36 : 6 36 10 0
sgpool-32 32 32 512 4 4 1 : 32 16 : 40 72 9 5 0 0 40 : 33 37 38 0
sgpool-16 32 42 268 3 3 1 : 32 16 : 42 82 3 0 0 0 46 : 228 37 233 0
sgpool-8 50 54 140 2 2 1 : 32 16 : 54 227 2 0 0 0 59 : 4322 45 4335 0
deadline_drq 22274 22361 64 379 379 1 : 32 16 : 22538 32880 507 0 0 1 91 : 31317 2219 10606 658
blkdev_requests 22273 22300 156 892 892 1 : 32 16 : 22541 32829 1204 5 0 2 57 : 30944 2592 10609 655
biovec-BIO_MAX_PAGES 256 260 3072 52 52 4 : 32 16 : 256 256 52 0 0 0 37 : 0 256 0 0
biovec-128 256 260 1536 52 52 2 : 32 16 : 256 256 52 0 0 0 37 : 0 256 0 0
biovec-64 256 260 768 52 52 1 : 32 16 : 260 268 52 0 0 0 37 : 8 259 11 0
biovec-16 256 266 204 14 14 1 : 32 16 : 266 296 14 0 0 0 51 : 293 260 297 0
biovec-4 256 310 60 5 5 1 : 32 16 : 272 304 5 0 0 0 94 : 481 259 484 0
biovec-1 285 288 24 2 2 1 : 32 16 : 512 1277 6 3 0 0 176 : 7486 320 7503 47
bio 274 318 72 6 6 1 : 32 16 : 509 1391 14 3 0 1 85 : 8265 333 8287 55
sock_inode_cache 36 40 484 5 5 1 : 32 16 : 40 55 5 0 0 0 40 : 545 10 519 0
skbuff_head_cache 16 24 160 1 1 1 : 32 16 : 24 42 1 0 0 0 56 : 13 3 0 0
sock 2 9 444 1 1 1 : 32 16 : 9 16 1 0 0 0 41 : 14 3 15 0
proc_inode_cache 82 99 440 11 11 1 : 32 16 : 99 179 11 0 0 0 41 : 572 17 521 1
sigqueue 9 27 144 1 1 1 : 32 16 : 16 16 1 0 0 0 59 : 767 1 768 0
radix_tree_node 840 840 272 60 60 1 : 32 16 : 840 884 60 0 0 0 46 : 774 73 11 0
cdev_cache 143 250 28 2 2 1 : 32 16 : 158 173 2 0 0 0 157 : 131 12 0 0
bdev_cache 4 32 120 1 1 1 : 32 16 : 20 32 1 0 0 0 64 : 89 2 87 0
mnt_cache 19 53 72 1 1 1 : 32 16 : 35 54 1 0 0 0 85 : 24 9 14 0
inode_cache 3375 3375 424 375 375 1 : 32 16 : 3375 3418 375 0 0 0 41 : 3340 624 589 0
dentry_cache 6744 6745 204 355 355 1 : 32 16 : 6744 7379 368 2 0 1 51 : 7671 1004 1911 34
filp 286 300 156 12 12 1 : 32 16 : 291 307 12 0 0 0 57 : 261 25 0 0
names_cache 1 1 4096 1 1 1 : 32 16 : 8 47 23 22 0 0 33 : 15424 47 15471 0
buffer_head 2458 2478 64 42 42 1 : 32 16 : 2458 3060 42 0 0 1 91 : 5662 206 3396 28
mm_struct 35 35 576 5 5 1 : 32 16 : 35 93 5 0 0 0 39 : 1494 11 1479 0
vm_area_struct 500 500 76 10 10 1 : 32 16 : 527 8855 13 1 0 1 82 : 29247 562 28815 513
fs_cache 40 84 44 1 1 1 : 32 16 : 40 128 2 1 0 0 116 : 946 8 929 0
files_cache 27 27 424 3 3 1 : 32 16 : 27 59 4 1 0 0 41 : 944 10 929 0
signal_cache 107 144 52 2 2 1 : 32 16 : 107 205 2 0 0 0 104 : 950 19 877 0
sighand_cache 87 87 1344 29 29 1 : 32 16 : 87 106 30 1 0 0 35 : 920 39 873 0
task_struct 95 95 1632 19 19 2 : 32 16 : 100 117 20 1 0 0 37 : 937 32 874 0
pte_chain 1650 1650 76 33 33 1 : 32 16 : 1650 6684 42 1 0 1 82 : 10041 455 8545 310
pgd 27 27 4096 27 27 1 : 32 16 : 33 51 48 21 0 0 33 : 1455 50 1479 0
size-131072(DMA) 0 0 131072 0 0 32 : 8 4 : 0 0 0 0 0 0 9 : 0 0 0 0
size-131072 0 0 131072 0 0 32 : 8 4 : 0 0 0 0 0 0 9 : 0 0 0 0
size-65536(DMA) 0 0 65536 0 0 16 : 8 4 : 0 0 0 0 0 0 9 : 0 0 0 0
size-65536 0 0 65536 0 0 16 : 8 4 : 0 0 0 0 0 0 9 : 0 0 0 0
size-32768(DMA) 0 0 32768 0 0 8 : 8 4 : 0 0 0 0 0 0 9 : 0 0 0 0
size-32768 0 0 32768 0 0 8 : 8 4 : 1 1 1 1 0 0 9 : 0 1 1 0
size-16384(DMA) 0 0 16384 0 0 4 : 8 4 : 0 0 0 0 0 0 9 : 0 0 0 0
size-16384 1 1 16384 1 1 4 : 8 4 : 2 4 4 3 0 0 9 : 2 4 5 0
size-8192(DMA) 0 0 8192 0 0 2 : 8 4 : 0 0 0 0 0 0 9 : 0 0 0 0
size-8192 91 91 8192 91 91 2 : 8 4 : 92 96 95 4 0 0 9 : 149 96 154 0
size-4096(DMA) 0 0 4096 0 0 1 : 32 16 : 0 0 0 0 0 0 33 : 0 0 0 0
size-4096 27 27 4096 27 27 1 : 32 16 : 34 57 55 28 0 0 33 : 489 57 519 0
size-2048(DMA) 0 0 2048 0 0 1 : 32 16 : 0 0 0 0 0 0 34 : 0 0 0 0
size-2048 94 94 2048 47 47 1 : 32 16 : 94 96 47 0 0 0 34 : 88 49 43 0
size-1024(DMA) 0 0 1024 0 0 1 : 32 16 : 0 0 0 0 0 0 36 : 0 0 0 0
size-1024 40 40 1024 10 10 1 : 32 16 : 40 67 11 1 0 0 36 : 702 39 703 0
size-512(DMA) 0 0 512 0 0 1 : 32 16 : 8 16 2 2 0 0 40 : 83 2 85 0
size-512 234 248 512 31 31 1 : 32 16 : 248 282 32 1 0 0 40 : 1721 46 1537 0
size-256(DMA) 0 0 268 0 0 1 : 32 16 : 14 28 2 2 0 0 46 : 534 2 536 0
size-256 94 98 268 7 7 1 : 32 16 : 98 149 8 1 0 0 46 : 736 14 657 0
size-192(DMA) 0 0 204 0 0 1 : 32 16 : 0 0 0 0 0 0 51 : 0 0 0 0
size-192 209 209 204 11 11 1 : 32 16 : 209 274 11 0 0 0 51 : 725 75 592 0
size-128(DMA) 0 0 140 0 0 1 : 32 16 : 0 0 0 0 0 0 59 : 0 0 0 0
size-128 681 702 140 26 26 1 : 32 16 : 691 955 26 0 0 0 59 : 2740 297 2351 6
size-64(DMA) 0 0 76 0 0 1 : 32 16 : 0 0 0 0 0 0 82 : 0 0 0 0
size-64 750 750 76 15 15 1 : 32 16 : 750 1041 15 0 0 0 82 : 1141 247 623 27
size-32(DMA) 0 0 44 0 0 1 : 32 16 : 0 0 0 0 0 0 116 : 0 0 0 0
size-32 452 504 44 6 6 1 : 32 16 : 452 619 6 0 0 0 116 : 18166 190 17772 142
kmem_cache 110 110 180 5 5 1 : 32 16 : 110 110 5 0 0 0 54 : 34 72 0 0
Douglas Gilbert <[email protected]> wrote:
>
> Andrew Morton wrote:
> > Douglas Gilbert <[email protected]> wrote:
> >
> >>>Slab: 464364 kB
> >>
> >
> > It's all in slab.
> >
> >
> >>I did notice a rather large growth of nodes
> >>in sysfs. For 84 added scsi_debug pseudo disks the number
> >>of sysfs nodes went from 686 to 3347.
> >>
> >>Does anybody know what is the per node memory cost of sysfs?
> >
> >
> > Let's see all of /pro/slabinfo please.
>
> Andrew,
> Attachments are /proc/slabinfo pre and post:
> $ modprobe scsi_debug add_host=42 num_devs=2
> which adds 84 pseudo disks.
>
OK, thanks. So with 48 disks you've lost five megabytes to blkdev_requests
and deadline_drq objects. With 4000 disks, you're toast. That's enough
request structures to put 200 gigabytes of memory under I/O ;)
We need to make the request structures dymanically allocated for other
reasons (which I cannot immediately remember) but it didn't happen. I guess
we have some motivation now.
On Saturday 22 March 2003 04:05 am, Andrew Morton wrote:
> OK, thanks. So with 48 disks you've lost five megabytes to blkdev_requests
> and deadline_drq objects. With 4000 disks, you're toast. That's enough
> request structures to put 200 gigabytes of memory under I/O ;)
>
> We need to make the request structures dymanically allocated for other
> reasons (which I cannot immediately remember) but it didn't happen. I
> guess we have some motivation now.
Here is the list of slab caches which consumed more than 1 MB
in the process of inserting 4000 disks.
#insmod scsi_debug.ko add_host=4 num_devs=1000
deadline_drq before:1280 after:1025420 diff:1024140 size:64 incr:65544960
blkdev_requests before:1280 after:1025400 diff:1024120 size:156 incr:159762720
* deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix this.
inode_cache before:700 after:140770 diff:140070 size:364 incr:50985480
dentry_cache before:4977 after:145061 diff:140084 size:172 incr:24094448
* inode cache increased by 50 MB, dentry cache 24 MB.
It looks like we cached 140,000 inodes. I wonder why ?
size-8192 before:8 after:4010 diff:4002 size:8192 incr:32784384
* 32 MB is for 4 hosts ram disk for scsi_debug (4*8MB)
size-2048 before:112 after:4102 diff:3990 size:2060 incr:8219400
size-512 before:87 after:8085 diff:7998 size:524 incr:4190952
size-192 before:907 after:16910 diff:16003 size:204 incr:3264612
size-64 before:459 after:76500 diff:76041 size:76 incr:5779116
size-32 before:523 after:24528 diff:24005 size:44 incr:1056220
* 30MB for all other structures. I need to look closely on what these are..
Thanks,
Badari
Badari Pulavarty <[email protected]> wrote:
>
> Here is the list of slab caches which consumed more than 1 MB
> in the process of inserting 4000 disks.
Thanks for doing this.
> #insmod scsi_debug.ko add_host=4 num_devs=1000
>
> deadline_drq before:1280 after:1025420 diff:1024140 size:64 incr:65544960
> blkdev_requests before:1280 after:1025400 diff:1024120 size:156 incr:159762720
>
> * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix this.
Yes, we do. But.
> inode_cache before:700 after:140770 diff:140070 size:364 incr:50985480
> dentry_cache before:4977 after:145061 diff:140084 size:172 incr:24094448
>
> * inode cache increased by 50 MB, dentry cache 24 MB.
> It looks like we cached 140,000 inodes. I wonder why ?
# find /sys/block/hda | wc -l
43
Oh shit, we're toast.
How many partitions did these "disks" have?
> size-2048 before:112 after:4102 diff:3990 size:2060 incr:8219400
> size-512 before:87 after:8085 diff:7998 size:524 incr:4190952
> size-192 before:907 after:16910 diff:16003 size:204 incr:3264612
> size-64 before:459 after:76500 diff:76041 size:76 incr:5779116
> size-32 before:523 after:24528 diff:24005 size:44 incr:1056220
>
> * 30MB for all other structures. I need to look closely on what these are..
Yes, that will take some work. What I would do is to change kmalloc:
+ int foo;
kmalloc(...)
{
+ if (foo)
+ dump_stack();
and then set `foo' in gdb across the registration of one disk.
Probably you can set foo by hand in scsi_register_device() or wherever it
happens.
Badari Pulavarty wrote:
> On Saturday 22 March 2003 04:05 am, Andrew Morton wrote:
>
>>OK, thanks. So with 48 disks you've lost five megabytes to blkdev_requests
>>and deadline_drq objects. With 4000 disks, you're toast. That's enough
>>request structures to put 200 gigabytes of memory under I/O ;)
>>
>>We need to make the request structures dymanically allocated for other
>>reasons (which I cannot immediately remember) but it didn't happen. I
>>guess we have some motivation now.
>
>
> Here is the list of slab caches which consumed more than 1 MB
> in the process of inserting 4000 disks.
>
> #insmod scsi_debug.ko add_host=4 num_devs=1000
>
> deadline_drq before:1280 after:1025420 diff:1024140 size:64 incr:65544960
> blkdev_requests before:1280 after:1025400 diff:1024120 size:156 incr:159762720
>
> * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix this.
>
> inode_cache before:700 after:140770 diff:140070 size:364 incr:50985480
> dentry_cache before:4977 after:145061 diff:140084 size:172 incr:24094448
>
> * inode cache increased by 50 MB, dentry cache 24 MB.
> It looks like we cached 140,000 inodes. I wonder why ?
> <snip/>
Badari,
What number do you get from
# cd /sys; du -a | wc
when you have 4000 disks?
With two disks the count on my system is 528.
Also scsi_debug should use only 8 MB (default) for
simulated storage shared between all pseudo disks
(i.e. not 8 MB per simulated host).
Doug Gilbert
On Monday 24 March 2003 02:22 pm, Douglas Gilbert wrote:
> Badari Pulavarty wrote:
> > On Saturday 22 March 2003 04:05 am, Andrew Morton wrote:
> >>OK, thanks. So with 48 disks you've lost five megabytes to
> >> blkdev_requests and deadline_drq objects. With 4000 disks, you're
> >> toast. That's enough request structures to put 200 gigabytes of memory
> >> under I/O ;)
> >>
> >>We need to make the request structures dymanically allocated for other
> >>reasons (which I cannot immediately remember) but it didn't happen. I
> >>guess we have some motivation now.
> >
> > Here is the list of slab caches which consumed more than 1 MB
> > in the process of inserting 4000 disks.
> >
> > #insmod scsi_debug.ko add_host=4 num_devs=1000
> >
> > deadline_drq before:1280 after:1025420 diff:1024140 size:64
> > incr:65544960 blkdev_requests before:1280 after:1025400 diff:1024120
> > size:156 incr:159762720
> >
> > * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix
> > this.
> >
> > inode_cache before:700 after:140770 diff:140070 size:364
> > incr:50985480 dentry_cache before:4977 after:145061 diff:140084
> > size:172 incr:24094448
> >
> > * inode cache increased by 50 MB, dentry cache 24 MB.
> > It looks like we cached 140,000 inodes. I wonder why ?
> >
> > <snip/>
>
> Badari,
> What number do you get from
> # cd /sys; du -a | wc
> when you have 4000 disks?
[root@elm3b78 sysfs]# du -a | wc -l
140735
Okay. That explains the inodes.
>
> Also scsi_debug should use only 8 MB (default) for
> simulated storage shared between all pseudo disks
> (i.e. not 8 MB per simulated host).
Hmm. When I did
insmod scsi_debug.o add_host=1 num_devs=4000
it used 8MB.
But when I did
insmod scsi_debug.o add_host=4 num_devs=1000
it used 32 MB. So I assumed it is per host.
Before:
size-8192 8 8 8192 8 8 2 : 8 4 : 9
12 10 2 0 0 37 : 3 12 7 0
After:
size-8192 4010 4010 8192 4010 4010 2 : 8 4 : 4010
4014 4012 2 0 0 37 : 4006 4014 4011 0
Thanks,
Badari
On Monday 24 March 2003 04:10 pm, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> > Here is the list of slab caches which consumed more than 1 MB
> > in the process of inserting 4000 disks.
>
> Thanks for doing this.
>
> > #insmod scsi_debug.ko add_host=4 num_devs=1000
> >
> > deadline_drq before:1280 after:1025420 diff:1024140 size:64
> > incr:65544960 blkdev_requests before:1280 after:1025400 diff:1024120
> > size:156 incr:159762720
> >
> > * deadline_drq, blkdev_requests consumed almost 80 MB. We need to fix
> > this.
>
> Yes, we do. But.
>
> > inode_cache before:700 after:140770 diff:140070 size:364
> > incr:50985480 dentry_cache before:4977 after:145061 diff:140084
> > size:172 incr:24094448
> >
> > * inode cache increased by 50 MB, dentry cache 24 MB.
> > It looks like we cached 140,000 inodes. I wonder why ?
>
> # find /sys/block/hda | wc -l
> 43
>
> Oh shit, we're toast.
>
> How many partitions did these "disks" have?
These are all scsi_debug disks. No partitions.
Yeah !! I know what you mean, with partitions we
are going to get 5 more inodes per partition..
[root@elm3b78 sdaaa]# find /sysfs/block/sdaa
/sysfs/block/sdaa
/sysfs/block/sdaa/iosched
/sysfs/block/sdaa/iosched/fifo_batch
/sysfs/block/sdaa/iosched/front_merges
/sysfs/block/sdaa/iosched/writes_starved
/sysfs/block/sdaa/iosched/write_expire
/sysfs/block/sdaa/iosched/read_expire
/sysfs/block/sdaa/device
/sysfs/block/sdaa/stat
/sysfs/block/sdaa/size
/sysfs/block/sdaa/range
/sysfs/block/sdaa/dev
Thanks,
Badari
On Sat, Mar 22 2003, Andrew Morton wrote:
> Douglas Gilbert <[email protected]> wrote:
> >
> > Andrew Morton wrote:
> > > Douglas Gilbert <[email protected]> wrote:
> > >
> > >>>Slab: 464364 kB
> > >>
> > >
> > > It's all in slab.
> > >
> > >
> > >>I did notice a rather large growth of nodes
> > >>in sysfs. For 84 added scsi_debug pseudo disks the number
> > >>of sysfs nodes went from 686 to 3347.
> > >>
> > >>Does anybody know what is the per node memory cost of sysfs?
> > >
> > >
> > > Let's see all of /pro/slabinfo please.
> >
> > Andrew,
> > Attachments are /proc/slabinfo pre and post:
> > $ modprobe scsi_debug add_host=42 num_devs=2
> > which adds 84 pseudo disks.
> >
>
> OK, thanks. So with 48 disks you've lost five megabytes to blkdev_requests
> and deadline_drq objects. With 4000 disks, you're toast. That's enough
> request structures to put 200 gigabytes of memory under I/O ;)
>
> We need to make the request structures dymanically allocated for other
> reasons (which I cannot immediately remember) but it didn't happen. I guess
> we have some motivation now.
Here's a patch that makes the request allocation (and io scheduler
private data) dynamic, with upper and lower bounds of 4 and 256
respectively. The numbers are a bit random - the 4 will allow us to make
progress, but it might be a smidgen too low. Perhaps 8 would be good.
256 is twice as much as before, but that should be alright as long as
the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
two variables.
We loose the old batching functionality, for now. I can resurrect that
if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
needs here. I'll probably end up doing a specialised block pool scheme
for this.
Hasn't been tested all that much, it boots though :-)
drivers/block/deadline-iosched.c | 86 ++++++++---------
drivers/block/elevator.c | 18 +++
drivers/block/ll_rw_blk.c | 197 +++++++++++++--------------------------
include/linux/blkdev.h | 11 +-
include/linux/elevator.h | 7 +
5 files changed, 141 insertions(+), 178 deletions(-)
===== drivers/block/deadline-iosched.c 1.17 vs edited =====
--- 1.17/drivers/block/deadline-iosched.c Fri Mar 14 16:55:04 2003
+++ edited/drivers/block/deadline-iosched.c Tue Mar 25 11:05:37 2003
@@ -71,6 +71,8 @@
int fifo_batch;
int writes_starved;
int front_merges;
+
+ mempool_t *drq_pool;
};
/*
@@ -673,28 +675,11 @@
static void deadline_exit(request_queue_t *q, elevator_t *e)
{
struct deadline_data *dd = e->elevator_data;
- struct deadline_rq *drq;
- struct request *rq;
- int i;
BUG_ON(!list_empty(&dd->fifo_list[READ]));
BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
- for (i = READ; i <= WRITE; i++) {
- struct request_list *rl = &q->rq[i];
- struct list_head *entry;
-
- list_for_each(entry, &rl->free) {
- rq = list_entry_rq(entry);
-
- if ((drq = RQ_DATA(rq)) == NULL)
- continue;
-
- rq->elevator_private = NULL;
- kmem_cache_free(drq_pool, drq);
- }
- }
-
+ mempool_destroy(dd->drq_pool);
kfree(dd->hash);
kfree(dd);
}
@@ -706,9 +691,7 @@
static int deadline_init(request_queue_t *q, elevator_t *e)
{
struct deadline_data *dd;
- struct deadline_rq *drq;
- struct request *rq;
- int i, ret = 0;
+ int i;
if (!drq_pool)
return -ENOMEM;
@@ -724,6 +707,13 @@
return -ENOMEM;
}
+ dd->drq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, drq_pool);
+ if (!dd->drq_pool) {
+ kfree(dd->hash);
+ kfree(dd);
+ return -ENOMEM;
+ }
+
for (i = 0; i < DL_HASH_ENTRIES; i++)
INIT_LIST_HEAD(&dd->hash[i]);
@@ -739,33 +729,41 @@
dd->front_merges = 1;
dd->fifo_batch = fifo_batch;
e->elevator_data = dd;
+ return 0;
+}
- for (i = READ; i <= WRITE; i++) {
- struct request_list *rl = &q->rq[i];
- struct list_head *entry;
-
- list_for_each(entry, &rl->free) {
- rq = list_entry_rq(entry);
-
- drq = kmem_cache_alloc(drq_pool, GFP_KERNEL);
- if (!drq) {
- ret = -ENOMEM;
- break;
- }
+static void deadline_put_request(request_queue_t *q, struct request *rq)
+{
+ struct deadline_data *dd = q->elevator.elevator_data;
+ struct deadline_rq *drq = RQ_DATA(rq);
- memset(drq, 0, sizeof(*drq));
- INIT_LIST_HEAD(&drq->fifo);
- INIT_LIST_HEAD(&drq->hash);
- RB_CLEAR(&drq->rb_node);
- drq->request = rq;
- rq->elevator_private = drq;
- }
+ if (drq) {
+ mempool_free(drq, dd->drq_pool);
+ rq->elevator_private = NULL;
}
+}
+
+static int
+deadline_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+ struct deadline_data *dd = q->elevator.elevator_data;
+ struct deadline_rq *drq;
- if (ret)
- deadline_exit(q, e);
+ drq = mempool_alloc(dd->drq_pool, gfp_mask);
+ if (drq) {
+ RB_CLEAR(&drq->rb_node);
+ drq->request = rq;
- return ret;
+ INIT_LIST_HEAD(&drq->hash);
+ drq->hash_valid_count = 0;
+
+ INIT_LIST_HEAD(&drq->fifo);
+
+ rq->elevator_private = drq;
+ return 0;
+ }
+
+ return 1;
}
/*
@@ -916,6 +914,8 @@
.elevator_queue_empty_fn = deadline_queue_empty,
.elevator_former_req_fn = deadline_former_request,
.elevator_latter_req_fn = deadline_latter_request,
+ .elevator_set_req_fn = deadline_set_request,
+ .elevator_put_req_fn = deadline_put_request,
.elevator_init_fn = deadline_init,
.elevator_exit_fn = deadline_exit,
===== drivers/block/elevator.c 1.40 vs edited =====
--- 1.40/drivers/block/elevator.c Sun Feb 16 12:32:35 2003
+++ edited/drivers/block/elevator.c Tue Mar 25 11:04:01 2003
@@ -408,6 +408,24 @@
return NULL;
}
+int elv_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+ elevator_t *e = &q->elevator;
+
+ if (e->elevator_set_req_fn)
+ return e->elevator_set_req_fn(q, rq, gfp_mask);
+
+ return 0;
+}
+
+void elv_put_request(request_queue_t *q, struct request *rq)
+{
+ elevator_t *e = &q->elevator;
+
+ if (e->elevator_put_req_fn)
+ e->elevator_put_req_fn(q, rq);
+}
+
int elv_register_queue(struct gendisk *disk)
{
request_queue_t *q = disk->queue;
===== drivers/block/ll_rw_blk.c 1.161 vs edited =====
--- 1.161/drivers/block/ll_rw_blk.c Mon Mar 24 02:56:03 2003
+++ edited/drivers/block/ll_rw_blk.c Tue Mar 25 11:53:37 2003
@@ -48,12 +48,6 @@
*/
static int queue_nr_requests;
-/*
- * How many free requests must be available before we wake a process which
- * is waiting for a request?
- */
-static int batch_requests;
-
unsigned long blk_max_low_pfn, blk_max_pfn;
int blk_nohighio = 0;
@@ -1118,26 +1112,6 @@
spin_unlock_irq(&blk_plug_lock);
}
-static int __blk_cleanup_queue(struct request_list *list)
-{
- struct list_head *head = &list->free;
- struct request *rq;
- int i = 0;
-
- while (!list_empty(head)) {
- rq = list_entry(head->next, struct request, queuelist);
- list_del_init(&rq->queuelist);
- kmem_cache_free(request_cachep, rq);
- i++;
- }
-
- if (i != list->count)
- printk("request list leak!\n");
-
- list->count = 0;
- return i;
-}
-
/**
* blk_cleanup_queue: - release a &request_queue_t when it is no longer needed
* @q: the request queue to be released
@@ -1154,18 +1128,14 @@
**/
void blk_cleanup_queue(request_queue_t * q)
{
- int count = (queue_nr_requests*2);
+ struct request_list *rl = &q->rq;
elevator_exit(q);
- count -= __blk_cleanup_queue(&q->rq[READ]);
- count -= __blk_cleanup_queue(&q->rq[WRITE]);
-
del_timer_sync(&q->unplug_timer);
flush_scheduled_work();
- if (count)
- printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+ mempool_destroy(rl->rq_pool);
if (blk_queue_tagged(q))
blk_queue_free_tags(q);
@@ -1175,42 +1145,17 @@
static int blk_init_free_list(request_queue_t *q)
{
- struct request_list *rl;
- struct request *rq;
- int i;
+ struct request_list *rl = &q->rq;
- INIT_LIST_HEAD(&q->rq[READ].free);
- INIT_LIST_HEAD(&q->rq[WRITE].free);
- q->rq[READ].count = 0;
- q->rq[WRITE].count = 0;
+ rl->count[READ] = BLKDEV_MAX_RQ;
+ rl->count[WRITE] = BLKDEV_MAX_RQ;
- /*
- * Divide requests in half between read and write
- */
- rl = &q->rq[READ];
- for (i = 0; i < (queue_nr_requests*2); i++) {
- rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
- if (!rq)
- goto nomem;
-
- /*
- * half way through, switch to WRITE list
- */
- if (i == queue_nr_requests)
- rl = &q->rq[WRITE];
+ rl->rq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, request_cachep);
- memset(rq, 0, sizeof(struct request));
- rq->rq_status = RQ_INACTIVE;
- list_add(&rq->queuelist, &rl->free);
- rl->count++;
- }
+ if (!rl->rq_pool)
+ return -ENOMEM;
- init_waitqueue_head(&q->rq[READ].wait);
- init_waitqueue_head(&q->rq[WRITE].wait);
return 0;
-nomem:
- blk_cleanup_queue(q);
- return 1;
}
static int __make_request(request_queue_t *, struct bio *);
@@ -1277,34 +1222,57 @@
return 0;
}
+static inline void blk_free_request(request_queue_t *q, struct request *rq)
+{
+ elv_put_request(q, rq);
+ mempool_free(rq, q->rq.rq_pool);
+}
+
+static inline struct request *blk_alloc_request(request_queue_t *q,int gfp_mask)
+{
+ struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+
+ if (rq) {
+ if (!elv_set_request(q, rq, gfp_mask))
+ return rq;
+
+ mempool_free(rq, q->rq.rq_pool);
+ }
+
+ return NULL;
+}
+
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
/*
* Get a free request. queue lock must be held and interrupts
* disabled on the way in.
*/
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
{
struct request *rq = NULL;
- struct request_list *rl = q->rq + rw;
+ struct request_list *rl = &q->rq;
- if (!list_empty(&rl->free)) {
- rq = blkdev_free_rq(&rl->free);
- list_del_init(&rq->queuelist);
- rq->ref_count = 1;
- rl->count--;
- if (rl->count < queue_congestion_on_threshold())
- set_queue_congested(q, rw);
+ if (!rl->count[rw])
+ return NULL;
+
+ rq = blk_alloc_request(q, gfp_mask);
+ if (rq) {
+ INIT_LIST_HEAD(&rq->queuelist);
rq->flags = 0;
- rq->rq_status = RQ_ACTIVE;
rq->errors = 0;
- rq->special = NULL;
- rq->buffer = NULL;
- rq->data = NULL;
- rq->sense = NULL;
- rq->waiting = NULL;
+ rq->rq_status = RQ_ACTIVE;
+ rl->count[rw]--;
+ if (rl->count[rw] < queue_congestion_on_threshold())
+ set_queue_congested(q, rw);
rq->bio = rq->biotail = NULL;
+ rq->buffer = NULL;
+ rq->ref_count = 1;
rq->q = q;
rq->rl = rl;
+ rq->special = NULL;
+ rq->data = NULL;
+ rq->waiting = NULL;
+ rq->sense = NULL;
}
return rq;
@@ -1316,7 +1284,7 @@
static struct request *get_request_wait(request_queue_t *q, int rw)
{
DEFINE_WAIT(wait);
- struct request_list *rl = &q->rq[rw];
+ struct request_list *rl = &q->rq;
struct request *rq;
spin_lock_prefetch(q->queue_lock);
@@ -1325,20 +1293,15 @@
do {
int block = 0;
- prepare_to_wait_exclusive(&rl->wait, &wait,
- TASK_UNINTERRUPTIBLE);
spin_lock_irq(q->queue_lock);
- if (!rl->count)
+ if (!rl->count[rw])
block = 1;
spin_unlock_irq(q->queue_lock);
if (block)
io_schedule();
- finish_wait(&rl->wait, &wait);
- spin_lock_irq(q->queue_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(q->queue_lock);
+ rq = get_request(q, rw, GFP_NOIO);
} while (rq == NULL);
return rq;
}
@@ -1349,13 +1312,7 @@
BUG_ON(rw != READ && rw != WRITE);
- spin_lock_irq(q->queue_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(q->queue_lock);
-
- if (!rq && (gfp_mask & __GFP_WAIT))
- rq = get_request_wait(q, rw);
-
+ rq = get_request(q, rw, gfp_mask);
if (rq) {
rq->flags = 0;
rq->buffer = NULL;
@@ -1374,7 +1331,7 @@
BUG_ON(rw != READ && rw != WRITE);
- rq = get_request(q, rw);
+ rq = get_request(q, rw, GFP_ATOMIC);
if (rq) {
rq->flags = 0;
@@ -1508,34 +1465,26 @@
if (unlikely(!q))
return;
- req->rq_status = RQ_INACTIVE;
- req->q = NULL;
- req->rl = NULL;
-
/*
* Request may not have originated from ll_rw_blk. if not,
* it didn't come out of our reserved rq pools
*/
if (rl) {
- int rw = 0;
+ int rw = rq_data_dir(req);
BUG_ON(!list_empty(&req->queuelist));
- list_add(&req->queuelist, &rl->free);
+ blk_free_request(q, req);
- if (rl == &q->rq[WRITE])
- rw = WRITE;
- else if (rl == &q->rq[READ])
- rw = READ;
- else
- BUG();
-
- rl->count++;
- if (rl->count >= queue_congestion_off_threshold())
+ rl->count[rw]++;
+ if (rl->count[rw] >= queue_congestion_off_threshold())
clear_queue_congested(q, rw);
- if (rl->count >= batch_requests && waitqueue_active(&rl->wait))
- wake_up(&rl->wait);
}
+
+ req->rq_status = RQ_INACTIVE;
+ req->q = NULL;
+ req->rl = NULL;
+ req->elevator_private = NULL;
}
void blk_put_request(struct request *req)
@@ -1772,7 +1721,7 @@
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
+ } else if ((req = get_request(q, rw, GFP_ATOMIC)) == NULL) {
spin_unlock_irq(q->queue_lock);
/*
@@ -1815,8 +1764,8 @@
__blk_put_request(q, freereq);
if (blk_queue_plugged(q)) {
- int nr_queued = (queue_nr_requests - q->rq[0].count) +
- (queue_nr_requests - q->rq[1].count);
+ int nr_queued = (queue_nr_requests - q->rq.count[0]) +
+ (queue_nr_requests - q->rq.count[1]);
if (nr_queued == q->unplug_thresh)
__generic_unplug_device(q);
}
@@ -2183,7 +2132,6 @@
int __init blk_dev_init(void)
{
- int total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
int i;
request_cachep = kmem_cache_create("blkdev_requests",
@@ -2191,24 +2139,11 @@
if (!request_cachep)
panic("Can't create request pool slab cache\n");
- /*
- * Free request slots per queue. One per quarter-megabyte.
- * We use this many requests for reads, and this many for writes.
- */
- queue_nr_requests = (total_ram >> 9) & ~7;
- if (queue_nr_requests < 16)
- queue_nr_requests = 16;
- if (queue_nr_requests > 128)
- queue_nr_requests = 128;
-
- batch_requests = queue_nr_requests / 8;
- if (batch_requests > 8)
- batch_requests = 8;
+ queue_nr_requests = BLKDEV_MAX_RQ;
printk("block request queues:\n");
- printk(" %d requests per read queue\n", queue_nr_requests);
- printk(" %d requests per write queue\n", queue_nr_requests);
- printk(" %d requests per batch\n", batch_requests);
+ printk(" %d/%d requests per read queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
+ printk(" %d/%d requests per write queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
printk(" enter congestion at %d\n", queue_congestion_on_threshold());
printk(" exit congestion at %d\n", queue_congestion_off_threshold());
===== include/linux/blkdev.h 1.98 vs edited =====
--- 1.98/include/linux/blkdev.h Tue Feb 18 11:29:00 2003
+++ edited/include/linux/blkdev.h Tue Mar 25 10:24:29 2003
@@ -10,6 +10,7 @@
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
#include <linux/wait.h>
+#include <linux/mempool.h>
#include <asm/scatterlist.h>
@@ -18,10 +19,12 @@
struct elevator_s;
typedef struct elevator_s elevator_t;
+#define BLKDEV_MIN_RQ 4
+#define BLKDEV_MAX_RQ 256
+
struct request_list {
- unsigned int count;
- struct list_head free;
- wait_queue_head_t wait;
+ unsigned int count[2];
+ mempool_t *rq_pool;
};
/*
@@ -180,7 +183,7 @@
/*
* the queue request freelist, one for reads and one for writes
*/
- struct request_list rq[2];
+ struct request_list rq;
request_fn_proc *request_fn;
merge_request_fn *back_merge_fn;
===== include/linux/elevator.h 1.18 vs edited =====
--- 1.18/include/linux/elevator.h Sun Jan 12 15:10:40 2003
+++ edited/include/linux/elevator.h Tue Mar 25 11:03:42 2003
@@ -15,6 +15,8 @@
typedef void (elevator_remove_req_fn) (request_queue_t *, struct request *);
typedef struct request *(elevator_request_list_fn) (request_queue_t *, struct request *);
typedef struct list_head *(elevator_get_sort_head_fn) (request_queue_t *, struct request *);
+typedef int (elevator_set_req_fn) (request_queue_t *, struct request *, int);
+typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
typedef int (elevator_init_fn) (request_queue_t *, elevator_t *);
typedef void (elevator_exit_fn) (request_queue_t *, elevator_t *);
@@ -34,6 +36,9 @@
elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
+ elevator_set_req_fn *elevator_set_req_fn;
+ elevator_put_req_fn *elevator_put_req_fn;
+
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
@@ -58,6 +63,8 @@
extern struct request *elv_latter_request(request_queue_t *, struct request *);
extern int elv_register_queue(struct gendisk *);
extern void elv_unregister_queue(struct gendisk *);
+extern int elv_set_request(request_queue_t *, struct request *, int);
+extern void elv_put_request(request_queue_t *, struct request *);
#define __elv_add_request_pos(q, rq, pos) \
(q)->elevator.elevator_add_req_fn((q), (rq), (pos))
--
Jens Axboe
On Tue, Mar 25 2003, Jens Axboe wrote:
> Here's a patch that makes the request allocation (and io scheduler
> private data) dynamic, with upper and lower bounds of 4 and 256
> respectively. The numbers are a bit random - the 4 will allow us to make
> progress, but it might be a smidgen too low. Perhaps 8 would be good.
> 256 is twice as much as before, but that should be alright as long as
> the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
> two variables.
>
> We loose the old batching functionality, for now. I can resurrect that
> if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
> needs here. I'll probably end up doing a specialised block pool scheme
> for this.
>
> Hasn't been tested all that much, it boots though :-)
Here's a version with better lock handling. We drop the queue_lock for
most parts of __make_request(), except the actual io scheduler calls.
Patch is against 2.5.66-BK as of today.
drivers/block/deadline-iosched.c | 86 ++++++-------
drivers/block/elevator.c | 18 ++
drivers/block/ll_rw_blk.c | 245 ++++++++++++---------------------------
include/linux/blkdev.h | 12 +
include/linux/elevator.h | 7 +
5 files changed, 154 insertions(+), 214 deletions(-)
===== drivers/block/deadline-iosched.c 1.17 vs edited =====
--- 1.17/drivers/block/deadline-iosched.c Fri Mar 14 16:55:04 2003
+++ edited/drivers/block/deadline-iosched.c Tue Mar 25 11:05:37 2003
@@ -71,6 +71,8 @@
int fifo_batch;
int writes_starved;
int front_merges;
+
+ mempool_t *drq_pool;
};
/*
@@ -673,28 +675,11 @@
static void deadline_exit(request_queue_t *q, elevator_t *e)
{
struct deadline_data *dd = e->elevator_data;
- struct deadline_rq *drq;
- struct request *rq;
- int i;
BUG_ON(!list_empty(&dd->fifo_list[READ]));
BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
- for (i = READ; i <= WRITE; i++) {
- struct request_list *rl = &q->rq[i];
- struct list_head *entry;
-
- list_for_each(entry, &rl->free) {
- rq = list_entry_rq(entry);
-
- if ((drq = RQ_DATA(rq)) == NULL)
- continue;
-
- rq->elevator_private = NULL;
- kmem_cache_free(drq_pool, drq);
- }
- }
-
+ mempool_destroy(dd->drq_pool);
kfree(dd->hash);
kfree(dd);
}
@@ -706,9 +691,7 @@
static int deadline_init(request_queue_t *q, elevator_t *e)
{
struct deadline_data *dd;
- struct deadline_rq *drq;
- struct request *rq;
- int i, ret = 0;
+ int i;
if (!drq_pool)
return -ENOMEM;
@@ -724,6 +707,13 @@
return -ENOMEM;
}
+ dd->drq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, drq_pool);
+ if (!dd->drq_pool) {
+ kfree(dd->hash);
+ kfree(dd);
+ return -ENOMEM;
+ }
+
for (i = 0; i < DL_HASH_ENTRIES; i++)
INIT_LIST_HEAD(&dd->hash[i]);
@@ -739,33 +729,41 @@
dd->front_merges = 1;
dd->fifo_batch = fifo_batch;
e->elevator_data = dd;
+ return 0;
+}
- for (i = READ; i <= WRITE; i++) {
- struct request_list *rl = &q->rq[i];
- struct list_head *entry;
-
- list_for_each(entry, &rl->free) {
- rq = list_entry_rq(entry);
-
- drq = kmem_cache_alloc(drq_pool, GFP_KERNEL);
- if (!drq) {
- ret = -ENOMEM;
- break;
- }
+static void deadline_put_request(request_queue_t *q, struct request *rq)
+{
+ struct deadline_data *dd = q->elevator.elevator_data;
+ struct deadline_rq *drq = RQ_DATA(rq);
- memset(drq, 0, sizeof(*drq));
- INIT_LIST_HEAD(&drq->fifo);
- INIT_LIST_HEAD(&drq->hash);
- RB_CLEAR(&drq->rb_node);
- drq->request = rq;
- rq->elevator_private = drq;
- }
+ if (drq) {
+ mempool_free(drq, dd->drq_pool);
+ rq->elevator_private = NULL;
}
+}
+
+static int
+deadline_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+ struct deadline_data *dd = q->elevator.elevator_data;
+ struct deadline_rq *drq;
- if (ret)
- deadline_exit(q, e);
+ drq = mempool_alloc(dd->drq_pool, gfp_mask);
+ if (drq) {
+ RB_CLEAR(&drq->rb_node);
+ drq->request = rq;
- return ret;
+ INIT_LIST_HEAD(&drq->hash);
+ drq->hash_valid_count = 0;
+
+ INIT_LIST_HEAD(&drq->fifo);
+
+ rq->elevator_private = drq;
+ return 0;
+ }
+
+ return 1;
}
/*
@@ -916,6 +914,8 @@
.elevator_queue_empty_fn = deadline_queue_empty,
.elevator_former_req_fn = deadline_former_request,
.elevator_latter_req_fn = deadline_latter_request,
+ .elevator_set_req_fn = deadline_set_request,
+ .elevator_put_req_fn = deadline_put_request,
.elevator_init_fn = deadline_init,
.elevator_exit_fn = deadline_exit,
===== drivers/block/elevator.c 1.40 vs edited =====
--- 1.40/drivers/block/elevator.c Sun Feb 16 12:32:35 2003
+++ edited/drivers/block/elevator.c Tue Mar 25 11:04:01 2003
@@ -408,6 +408,24 @@
return NULL;
}
+int elv_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+ elevator_t *e = &q->elevator;
+
+ if (e->elevator_set_req_fn)
+ return e->elevator_set_req_fn(q, rq, gfp_mask);
+
+ return 0;
+}
+
+void elv_put_request(request_queue_t *q, struct request *rq)
+{
+ elevator_t *e = &q->elevator;
+
+ if (e->elevator_put_req_fn)
+ e->elevator_put_req_fn(q, rq);
+}
+
int elv_register_queue(struct gendisk *disk)
{
request_queue_t *q = disk->queue;
===== drivers/block/ll_rw_blk.c 1.161 vs edited =====
--- 1.161/drivers/block/ll_rw_blk.c Mon Mar 24 02:56:03 2003
+++ edited/drivers/block/ll_rw_blk.c Tue Mar 25 12:20:16 2003
@@ -48,12 +48,6 @@
*/
static int queue_nr_requests;
-/*
- * How many free requests must be available before we wake a process which
- * is waiting for a request?
- */
-static int batch_requests;
-
unsigned long blk_max_low_pfn, blk_max_pfn;
int blk_nohighio = 0;
@@ -1118,26 +1112,6 @@
spin_unlock_irq(&blk_plug_lock);
}
-static int __blk_cleanup_queue(struct request_list *list)
-{
- struct list_head *head = &list->free;
- struct request *rq;
- int i = 0;
-
- while (!list_empty(head)) {
- rq = list_entry(head->next, struct request, queuelist);
- list_del_init(&rq->queuelist);
- kmem_cache_free(request_cachep, rq);
- i++;
- }
-
- if (i != list->count)
- printk("request list leak!\n");
-
- list->count = 0;
- return i;
-}
-
/**
* blk_cleanup_queue: - release a &request_queue_t when it is no longer needed
* @q: the request queue to be released
@@ -1154,18 +1128,14 @@
**/
void blk_cleanup_queue(request_queue_t * q)
{
- int count = (queue_nr_requests*2);
+ struct request_list *rl = &q->rq;
elevator_exit(q);
- count -= __blk_cleanup_queue(&q->rq[READ]);
- count -= __blk_cleanup_queue(&q->rq[WRITE]);
-
del_timer_sync(&q->unplug_timer);
flush_scheduled_work();
- if (count)
- printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+ mempool_destroy(rl->rq_pool);
if (blk_queue_tagged(q))
blk_queue_free_tags(q);
@@ -1175,42 +1145,17 @@
static int blk_init_free_list(request_queue_t *q)
{
- struct request_list *rl;
- struct request *rq;
- int i;
+ struct request_list *rl = &q->rq;
- INIT_LIST_HEAD(&q->rq[READ].free);
- INIT_LIST_HEAD(&q->rq[WRITE].free);
- q->rq[READ].count = 0;
- q->rq[WRITE].count = 0;
+ rl->count[READ] = BLKDEV_MAX_RQ;
+ rl->count[WRITE] = BLKDEV_MAX_RQ;
- /*
- * Divide requests in half between read and write
- */
- rl = &q->rq[READ];
- for (i = 0; i < (queue_nr_requests*2); i++) {
- rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
- if (!rq)
- goto nomem;
-
- /*
- * half way through, switch to WRITE list
- */
- if (i == queue_nr_requests)
- rl = &q->rq[WRITE];
+ rl->rq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, request_cachep);
- memset(rq, 0, sizeof(struct request));
- rq->rq_status = RQ_INACTIVE;
- list_add(&rq->queuelist, &rl->free);
- rl->count++;
- }
+ if (!rl->rq_pool)
+ return -ENOMEM;
- init_waitqueue_head(&q->rq[READ].wait);
- init_waitqueue_head(&q->rq[WRITE].wait);
return 0;
-nomem:
- blk_cleanup_queue(q);
- return 1;
}
static int __make_request(request_queue_t *, struct bio *);
@@ -1277,34 +1222,62 @@
return 0;
}
+static inline void blk_free_request(request_queue_t *q, struct request *rq)
+{
+ elv_put_request(q, rq);
+ mempool_free(rq, q->rq.rq_pool);
+}
+
+static inline struct request *blk_alloc_request(request_queue_t *q,int gfp_mask)
+{
+ struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+
+ if (rq) {
+ if (!elv_set_request(q, rq, gfp_mask))
+ return rq;
+
+ mempool_free(rq, q->rq.rq_pool);
+ }
+
+ return NULL;
+}
+
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
/*
- * Get a free request. queue lock must be held and interrupts
- * disabled on the way in.
+ * Get a free request, queue_lock must not be held
*/
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
{
struct request *rq = NULL;
- struct request_list *rl = q->rq + rw;
+ struct request_list *rl = &q->rq;
- if (!list_empty(&rl->free)) {
- rq = blkdev_free_rq(&rl->free);
- list_del_init(&rq->queuelist);
- rq->ref_count = 1;
- rl->count--;
- if (rl->count < queue_congestion_on_threshold())
+ /*
+ * known racy, ok though
+ */
+ if (!rl->count[rw])
+ return NULL;
+
+ rq = blk_alloc_request(q, gfp_mask);
+ if (rq) {
+ spin_lock_irq(q->queue_lock);
+ rl->count[rw]--;
+ if (rl->count[rw] < queue_congestion_on_threshold())
set_queue_congested(q, rw);
+ spin_unlock_irq(q->queue_lock);
+
+ INIT_LIST_HEAD(&rq->queuelist);
rq->flags = 0;
- rq->rq_status = RQ_ACTIVE;
rq->errors = 0;
- rq->special = NULL;
- rq->buffer = NULL;
- rq->data = NULL;
- rq->sense = NULL;
- rq->waiting = NULL;
+ rq->rq_status = RQ_ACTIVE;
rq->bio = rq->biotail = NULL;
+ rq->buffer = NULL;
+ rq->ref_count = 1;
rq->q = q;
rq->rl = rl;
+ rq->special = NULL;
+ rq->data = NULL;
+ rq->waiting = NULL;
+ rq->sense = NULL;
}
return rq;
@@ -1316,7 +1289,7 @@
static struct request *get_request_wait(request_queue_t *q, int rw)
{
DEFINE_WAIT(wait);
- struct request_list *rl = &q->rq[rw];
+ struct request_list *rl = &q->rq;
struct request *rq;
spin_lock_prefetch(q->queue_lock);
@@ -1325,64 +1298,24 @@
do {
int block = 0;
- prepare_to_wait_exclusive(&rl->wait, &wait,
- TASK_UNINTERRUPTIBLE);
spin_lock_irq(q->queue_lock);
- if (!rl->count)
+ if (!rl->count[rw])
block = 1;
spin_unlock_irq(q->queue_lock);
if (block)
io_schedule();
- finish_wait(&rl->wait, &wait);
- spin_lock_irq(q->queue_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(q->queue_lock);
+ rq = get_request(q, rw, GFP_NOIO);
} while (rq == NULL);
return rq;
}
struct request *blk_get_request(request_queue_t *q, int rw, int gfp_mask)
{
- struct request *rq;
-
- BUG_ON(rw != READ && rw != WRITE);
-
- spin_lock_irq(q->queue_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(q->queue_lock);
-
- if (!rq && (gfp_mask & __GFP_WAIT))
- rq = get_request_wait(q, rw);
-
- if (rq) {
- rq->flags = 0;
- rq->buffer = NULL;
- rq->bio = rq->biotail = NULL;
- rq->waiting = NULL;
- }
- return rq;
-}
-
-/*
- * Non-locking blk_get_request variant, for special requests from drivers.
- */
-struct request *__blk_get_request(request_queue_t *q, int rw)
-{
- struct request *rq;
-
BUG_ON(rw != READ && rw != WRITE);
- rq = get_request(q, rw);
-
- if (rq) {
- rq->flags = 0;
- rq->buffer = NULL;
- rq->bio = rq->biotail = NULL;
- rq->waiting = NULL;
- }
- return rq;
+ return get_request(q, rw, gfp_mask);
}
/**
@@ -1499,6 +1432,9 @@
disk->stamp_idle = now;
}
+/*
+ * queue lock must be held
+ */
void __blk_put_request(request_queue_t *q, struct request *req)
{
struct request_list *rl = req->rl;
@@ -1508,34 +1444,26 @@
if (unlikely(!q))
return;
- req->rq_status = RQ_INACTIVE;
- req->q = NULL;
- req->rl = NULL;
-
/*
* Request may not have originated from ll_rw_blk. if not,
* it didn't come out of our reserved rq pools
*/
if (rl) {
- int rw = 0;
+ int rw = rq_data_dir(req);
BUG_ON(!list_empty(&req->queuelist));
- list_add(&req->queuelist, &rl->free);
+ blk_free_request(q, req);
- if (rl == &q->rq[WRITE])
- rw = WRITE;
- else if (rl == &q->rq[READ])
- rw = READ;
- else
- BUG();
-
- rl->count++;
- if (rl->count >= queue_congestion_off_threshold())
+ rl->count[rw]++;
+ if (rl->count[rw] >= queue_congestion_off_threshold())
clear_queue_congested(q, rw);
- if (rl->count >= batch_requests && waitqueue_active(&rl->wait))
- wake_up(&rl->wait);
}
+
+ req->rq_status = RQ_INACTIVE;
+ req->q = NULL;
+ req->rl = NULL;
+ req->elevator_private = NULL;
}
void blk_put_request(struct request *req)
@@ -1694,9 +1622,9 @@
barrier = test_bit(BIO_RW_BARRIER, &bio->bi_rw);
- spin_lock_irq(q->queue_lock);
again:
insert_here = NULL;
+ spin_lock_irq(q->queue_lock);
if (blk_queue_empty(q)) {
blk_plug_device(q);
@@ -1769,12 +1697,12 @@
* a free slot.
*/
get_rq:
+ spin_unlock_irq(q->queue_lock);
+
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(q->queue_lock);
-
+ } else if ((req = get_request(q, rw, GFP_NOIO)) == NULL) {
/*
* READA bit set
*/
@@ -1782,7 +1710,6 @@
goto end_io;
freereq = get_request_wait(q, rw);
- spin_lock_irq(q->queue_lock);
goto again;
}
@@ -1809,19 +1736,20 @@
req->bio = req->biotail = bio;
req->rq_disk = bio->bi_bdev->bd_disk;
req->start_time = jiffies;
+
+ spin_lock_irq(q->queue_lock);
add_request(q, req, insert_here);
out:
if (freereq)
__blk_put_request(q, freereq);
if (blk_queue_plugged(q)) {
- int nr_queued = (queue_nr_requests - q->rq[0].count) +
- (queue_nr_requests - q->rq[1].count);
+ int nr_queued = (queue_nr_requests - q->rq.count[0]) +
+ (queue_nr_requests - q->rq.count[1]);
if (nr_queued == q->unplug_thresh)
__generic_unplug_device(q);
}
spin_unlock_irq(q->queue_lock);
-
return 0;
end_io:
@@ -2183,7 +2111,6 @@
int __init blk_dev_init(void)
{
- int total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
int i;
request_cachep = kmem_cache_create("blkdev_requests",
@@ -2191,24 +2118,11 @@
if (!request_cachep)
panic("Can't create request pool slab cache\n");
- /*
- * Free request slots per queue. One per quarter-megabyte.
- * We use this many requests for reads, and this many for writes.
- */
- queue_nr_requests = (total_ram >> 9) & ~7;
- if (queue_nr_requests < 16)
- queue_nr_requests = 16;
- if (queue_nr_requests > 128)
- queue_nr_requests = 128;
-
- batch_requests = queue_nr_requests / 8;
- if (batch_requests > 8)
- batch_requests = 8;
+ queue_nr_requests = BLKDEV_MAX_RQ;
printk("block request queues:\n");
- printk(" %d requests per read queue\n", queue_nr_requests);
- printk(" %d requests per write queue\n", queue_nr_requests);
- printk(" %d requests per batch\n", batch_requests);
+ printk(" %d/%d requests per read queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
+ printk(" %d/%d requests per write queue\n", BLKDEV_MIN_RQ, queue_nr_requests);
printk(" enter congestion at %d\n", queue_congestion_on_threshold());
printk(" exit congestion at %d\n", queue_congestion_off_threshold());
@@ -2250,7 +2164,6 @@
EXPORT_SYMBOL(blk_phys_contig_segment);
EXPORT_SYMBOL(blk_hw_contig_segment);
EXPORT_SYMBOL(blk_get_request);
-EXPORT_SYMBOL(__blk_get_request);
EXPORT_SYMBOL(blk_put_request);
EXPORT_SYMBOL(blk_insert_request);
===== include/linux/blkdev.h 1.98 vs edited =====
--- 1.98/include/linux/blkdev.h Tue Feb 18 11:29:00 2003
+++ edited/include/linux/blkdev.h Tue Mar 25 12:12:20 2003
@@ -10,6 +10,7 @@
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
#include <linux/wait.h>
+#include <linux/mempool.h>
#include <asm/scatterlist.h>
@@ -18,10 +19,12 @@
struct elevator_s;
typedef struct elevator_s elevator_t;
+#define BLKDEV_MIN_RQ 4
+#define BLKDEV_MAX_RQ 256
+
struct request_list {
- unsigned int count;
- struct list_head free;
- wait_queue_head_t wait;
+ int count[2];
+ mempool_t *rq_pool;
};
/*
@@ -180,7 +183,7 @@
/*
* the queue request freelist, one for reads and one for writes
*/
- struct request_list rq[2];
+ struct request_list rq;
request_fn_proc *request_fn;
merge_request_fn *back_merge_fn;
@@ -330,7 +333,6 @@
extern void blk_attempt_remerge(request_queue_t *, struct request *);
extern void __blk_attempt_remerge(request_queue_t *, struct request *);
extern struct request *blk_get_request(request_queue_t *, int, int);
-extern struct request *__blk_get_request(request_queue_t *, int);
extern void blk_put_request(struct request *);
extern void blk_insert_request(request_queue_t *, struct request *, int, void *);
extern void blk_plug_device(request_queue_t *);
===== include/linux/elevator.h 1.18 vs edited =====
--- 1.18/include/linux/elevator.h Sun Jan 12 15:10:40 2003
+++ edited/include/linux/elevator.h Tue Mar 25 11:03:42 2003
@@ -15,6 +15,8 @@
typedef void (elevator_remove_req_fn) (request_queue_t *, struct request *);
typedef struct request *(elevator_request_list_fn) (request_queue_t *, struct request *);
typedef struct list_head *(elevator_get_sort_head_fn) (request_queue_t *, struct request *);
+typedef int (elevator_set_req_fn) (request_queue_t *, struct request *, int);
+typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
typedef int (elevator_init_fn) (request_queue_t *, elevator_t *);
typedef void (elevator_exit_fn) (request_queue_t *, elevator_t *);
@@ -34,6 +36,9 @@
elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
+ elevator_set_req_fn *elevator_set_req_fn;
+ elevator_put_req_fn *elevator_put_req_fn;
+
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
@@ -58,6 +63,8 @@
extern struct request *elv_latter_request(request_queue_t *, struct request *);
extern int elv_register_queue(struct gendisk *);
extern void elv_unregister_queue(struct gendisk *);
+extern int elv_set_request(request_queue_t *, struct request *, int);
+extern void elv_put_request(request_queue_t *, struct request *);
#define __elv_add_request_pos(q, rq, pos) \
(q)->elevator.elevator_add_req_fn((q), (rq), (pos))
--
Jens Axboe
Jens Axboe wrote:
>On Sat, Mar 22 2003, Andrew Morton wrote:
>
>>Douglas Gilbert <[email protected]> wrote:
>>
>>>Andrew Morton wrote:
>>>
>>>>Douglas Gilbert <[email protected]> wrote:
>>>>
>>>>
>>>>>>Slab: 464364 kB
>>>>>>
>>>>It's all in slab.
>>>>
>>>>
>>>>
>>>>>I did notice a rather large growth of nodes
>>>>>in sysfs. For 84 added scsi_debug pseudo disks the number
>>>>>of sysfs nodes went from 686 to 3347.
>>>>>
>>>>>Does anybody know what is the per node memory cost of sysfs?
>>>>>
>>>>
>>>>Let's see all of /pro/slabinfo please.
>>>>
>>>Andrew,
>>>Attachments are /proc/slabinfo pre and post:
>>> $ modprobe scsi_debug add_host=42 num_devs=2
>>>which adds 84 pseudo disks.
>>>
>>>
>>OK, thanks. So with 48 disks you've lost five megabytes to blkdev_requests
>>and deadline_drq objects. With 4000 disks, you're toast. That's enough
>>request structures to put 200 gigabytes of memory under I/O ;)
>>
>>We need to make the request structures dymanically allocated for other
>>reasons (which I cannot immediately remember) but it didn't happen. I guess
>>we have some motivation now.
>>
>
>Here's a patch that makes the request allocation (and io scheduler
>private data) dynamic, with upper and lower bounds of 4 and 256
>respectively. The numbers are a bit random - the 4 will allow us to make
>progress, but it might be a smidgen too low. Perhaps 8 would be good.
>256 is twice as much as before, but that should be alright as long as
>the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
>two variables.
>
>We loose the old batching functionality, for now. I can resurrect that
>if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
>needs here. I'll probably end up doing a specialised block pool scheme
>for this.
>
>Hasn't been tested all that much, it boots though :-)
>
Nice Jens. Very good in theory but I haven't looked at the
code too much yet.
Would it be possible to have all queues allocate out of
the one global pool of free requests. This way you could
have a big minimum (say 128) and a big maximum
(say min(Mbytes, spindles).
This way memory usage is decoupled from the number of
queues, and busy spindles could make use of more
available free requests.
Oh and the max value can easily be runtime tunable, right?
On Tue, Mar 25 2003, Jens Axboe wrote:
> On Tue, Mar 25 2003, Jens Axboe wrote:
> > Here's a patch that makes the request allocation (and io scheduler
> > private data) dynamic, with upper and lower bounds of 4 and 256
> > respectively. The numbers are a bit random - the 4 will allow us to make
> > progress, but it might be a smidgen too low. Perhaps 8 would be good.
> > 256 is twice as much as before, but that should be alright as long as
> > the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
> > two variables.
> >
> > We loose the old batching functionality, for now. I can resurrect that
> > if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
> > needs here. I'll probably end up doing a specialised block pool scheme
> > for this.
> >
> > Hasn't been tested all that much, it boots though :-)
>
> Here's a version with better lock handling. We drop the queue_lock for
> most parts of __make_request(), except the actual io scheduler calls.
That was buggy, fixing... Back later.
--
Jens Axboe
On Tue, Mar 25 2003, Nick Piggin wrote:
>
>
> Jens Axboe wrote:
>
> >On Sat, Mar 22 2003, Andrew Morton wrote:
> >
> >>Douglas Gilbert <[email protected]> wrote:
> >>
> >>>Andrew Morton wrote:
> >>>
> >>>>Douglas Gilbert <[email protected]> wrote:
> >>>>
> >>>>
> >>>>>>Slab: 464364 kB
> >>>>>>
> >>>>It's all in slab.
> >>>>
> >>>>
> >>>>
> >>>>>I did notice a rather large growth of nodes
> >>>>>in sysfs. For 84 added scsi_debug pseudo disks the number
> >>>>>of sysfs nodes went from 686 to 3347.
> >>>>>
> >>>>>Does anybody know what is the per node memory cost of sysfs?
> >>>>>
> >>>>
> >>>>Let's see all of /pro/slabinfo please.
> >>>>
> >>>Andrew,
> >>>Attachments are /proc/slabinfo pre and post:
> >>> $ modprobe scsi_debug add_host=42 num_devs=2
> >>>which adds 84 pseudo disks.
> >>>
> >>>
> >>OK, thanks. So with 48 disks you've lost five megabytes to
> >>blkdev_requests
> >>and deadline_drq objects. With 4000 disks, you're toast. That's enough
> >>request structures to put 200 gigabytes of memory under I/O ;)
> >>
> >>We need to make the request structures dymanically allocated for other
> >>reasons (which I cannot immediately remember) but it didn't happen. I
> >>guess
> >>we have some motivation now.
> >>
> >
> >Here's a patch that makes the request allocation (and io scheduler
> >private data) dynamic, with upper and lower bounds of 4 and 256
> >respectively. The numbers are a bit random - the 4 will allow us to make
> >progress, but it might be a smidgen too low. Perhaps 8 would be good.
> >256 is twice as much as before, but that should be alright as long as
> >the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these
> >two variables.
> >
> >We loose the old batching functionality, for now. I can resurrect that
> >if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our
> >needs here. I'll probably end up doing a specialised block pool scheme
> >for this.
> >
> >Hasn't been tested all that much, it boots though :-)
> >
> Nice Jens. Very good in theory but I haven't looked at the
> code too much yet.
>
> Would it be possible to have all queues allocate out of
> the one global pool of free requests. This way you could
> have a big minimum (say 128) and a big maximum
> (say min(Mbytes, spindles).
Well not really, as far as I can see we _need_ a pool per queue. Imagine
a bio handed to raid, needs to be split to 6 different queues. But our
minimum is 4, deadlock possibility. It could probably be made to work,
however I greatly prefer a per-queue reserve.
> This way memory usage is decoupled from the number of
> queues, and busy spindles could make use of more
> available free requests.
>
> Oh and the max value can easily be runtime tunable, right?
Sure. However, they don't really mean _anything_. Max is just some
random number to prevent one queue going nuts, and could be completely
removed if the vm works perfectly. Beyond some limit there's little
benefit to doing that, though. But MAX could be runtime tunable. Min is
basically just to make sure we don't kill ourselves, I don't see any
point in making that runtime tunable. It's not really a tunable.
--
Jens Axboe
Jens Axboe wrote:
>On Tue, Mar 25 2003, Nick Piggin wrote:
>
>>
>>Nice Jens. Very good in theory but I haven't looked at the
>>code too much yet.
>>
>>Would it be possible to have all queues allocate out of
>>the one global pool of free requests. This way you could
>>have a big minimum (say 128) and a big maximum
>>(say min(Mbytes, spindles).
>>
>
>Well not really, as far as I can see we _need_ a pool per queue. Imagine
>a bio handed to raid, needs to be split to 6 different queues. But our
>minimum is 4, deadlock possibility. It could probably be made to work,
>however I greatly prefer a per-queue reserve.
>
OK yeah you are right there. In light of your comment below
I'm happy with that. I was mostly worried about queues being
restricted to a small maximum.
>
>
>>This way memory usage is decoupled from the number of
>>queues, and busy spindles could make use of more
>>available free requests.
>>
>>Oh and the max value can easily be runtime tunable, right?
>>
>
>Sure. However, they don't really mean _anything_. Max is just some
>random number to prevent one queue going nuts, and could be completely
>removed if the vm works perfectly. Beyond some limit there's little
>benefit to doing that, though. But MAX could be runtime tunable. Min is
>basically just to make sure we don't kill ourselves, I don't see any
>point in making that runtime tunable. It's not really a tunable.
>
OK thats good then. I would like to see max removed, however perhaps
the VM isn't up to that yet. I'll be testing this when your code
solidifies!
On Tue, Mar 25 2003, Nick Piggin wrote:
> Jens Axboe wrote:
>
> >On Tue, Mar 25 2003, Nick Piggin wrote:
> >
> >>
> >>Nice Jens. Very good in theory but I haven't looked at the
> >>code too much yet.
> >>
> >>Would it be possible to have all queues allocate out of
> >>the one global pool of free requests. This way you could
> >>have a big minimum (say 128) and a big maximum
> >>(say min(Mbytes, spindles).
> >>
> >
> >Well not really, as far as I can see we _need_ a pool per queue. Imagine
> >a bio handed to raid, needs to be split to 6 different queues. But our
> >minimum is 4, deadlock possibility. It could probably be made to work,
> >however I greatly prefer a per-queue reserve.
> >
> OK yeah you are right there. In light of your comment below
> I'm happy with that. I was mostly worried about queues being
> restricted to a small maximum.
Understandable, we'll make max tunable.
> >>This way memory usage is decoupled from the number of
> >>queues, and busy spindles could make use of more
> >>available free requests.
> >>
> >>Oh and the max value can easily be runtime tunable, right?
> >>
> >
> >Sure. However, they don't really mean _anything_. Max is just some
> >random number to prevent one queue going nuts, and could be completely
> >removed if the vm works perfectly. Beyond some limit there's little
> >benefit to doing that, though. But MAX could be runtime tunable. Min is
> >basically just to make sure we don't kill ourselves, I don't see any
> >point in making that runtime tunable. It's not really a tunable.
> >
> OK thats good then. I would like to see max removed, however perhaps
> the VM isn't up to that yet. I'll be testing this when your code
> solidifies!
Only testing will tell, so yes you are very welcome to give it a shot.
Let me release a known working version first :)
--
Jens Axboe
On Tuesday 25 March 2003 04:35 am, Jens Axboe wrote:
> Only testing will tell, so yes you are very welcome to give it a shot.
> Let me release a known working version first :)
Jens,
I found whats using 32MB out of 8192-byte slab.
size-8192 before:10 after:4012 diff:4002 size:8192 incr:32784384
It is deadline_init():
dd->hash = kmalloc(sizeof(struct list_head)*DL_HASH_ENTRIES,GFP_KERNEL);
It is creating 8K hash table for each queue. Since we have 4000 queues,
it used 32MB. I wonder why the current code needs 1024 hash buckets,
when maximum requests are only 256. And also, since you are making
request allocation dynamic, can you change this too ? Any issues here ?
Thanks,
Badari
On Wed, Mar 26 2003, Badari Pulavarty wrote:
> On Tuesday 25 March 2003 04:35 am, Jens Axboe wrote:
>
> > Only testing will tell, so yes you are very welcome to give it a shot.
> > Let me release a known working version first :)
>
> Jens,
>
> I found whats using 32MB out of 8192-byte slab.
>
> size-8192 before:10 after:4012 diff:4002 size:8192 incr:32784384
>
> It is deadline_init():
>
> dd->hash = kmalloc(sizeof(struct list_head)*DL_HASH_ENTRIES,GFP_KERNEL);
>
> It is creating 8K hash table for each queue. Since we have 4000 queues,
Yes
> it used 32MB. I wonder why the current code needs 1024 hash buckets,
Hmm actually that's a leftover from when we played with bigger queue
sizes, I inadvertently forgot to change it back when pushing the rbtree
deadline update to Linus. It used to be 256. We can shrink this to 2^7
or 2^8 instead, which will then only eat 1-2K.
> when maximum requests are only 256. And also, since you are making
> request allocation dynamic, can you change this too ? Any issues here ?
No real issues to shrinking it, bigger problem if we move to larger
queues. With the rq-dyn-alloc patch, we can make the max number of
requests ceiling a lot higher and then the hash needs to be bigger too.
But for now, 256 entry should be a good default and suffice for the
future, I'll push that change.
--
Jens Axboe
Hi All,
I found 2048-byte slab heavy users. 8MB doesn't seem much they all
add up.
size-2048 before:98 after:4095 diff:3997 size:2060 incr:8233820
The problem is with
sd.c: sd_attach()
alloc_disk(16)
alloc_disk() allocates "hd_struct" structure for 15 minors.
So it is a 84*15 = 1260 byte allocation. They all come from
2048-byte slabs. Since I have 4000 simulated disks, it uses up 8MB.
Proposed fixes:
1) Make the allocations come from its own slab, instead of
2048-byte slab. (~40% saving).
2) Instead of allocatinf hd_struct structure for all possible partitions,
why not allocated them dynamically, as we see a partition ? This
way we could (in theory) support more than 16 partitions, if needed.
Are there any issues with doing (2) ?
Thanks,
Badari
On Fri, Mar 28, 2003 at 09:04:41AM -0800, Badari Pulavarty wrote:
> 2) Instead of allocating hd_struct structure for all possible partitions,
> why not allocated them dynamically, as we see a partition ? This
> way we could (in theory) support more than 16 partitions, if needed.
This is what I plan to do.
Of course you are welcome to do it first.
Andries
On Friday 28 March 2003 10:41 am, Andries Brouwer wrote:
> On Fri, Mar 28, 2003 at 09:04:41AM -0800, Badari Pulavarty wrote:
> > 2) Instead of allocating hd_struct structure for all possible partitions,
> > why not allocated them dynamically, as we see a partition ? This
> > way we could (in theory) support more than 16 partitions, if needed.
>
> This is what I plan to do.
> Of course you are welcome to do it first.
>
> Andries
Okay !! Here is my patch to add hd_structs dynamically as we add partitions.
Machine boots fine. I was able to add/delete partitions.
It is not polished yet, but any comments ?
Thanks,
Badari