Hello!
I've upgraded a while to 2.4.19 and my box has been happy for the last 52
days (it's a dual PIII). Tonight while going through my logs, I've found
these:
Sep 25 22:18:41 bigip kernel: Warning - running *really* short on DMA buffers
Sep 25 22:18:47 bigip last message repeated 55 times
Sep 25 22:19:41 bigip last message repeated 71 times
I know where it's coming from (drivers/scsi/scsi_merge.c):
/* scsi_malloc can only allocate in chunks of 512 bytes so
* round it up.
*/
SCpnt->sglist_len = (SCpnt->sglist_len + 511) & ~511;
sgpnt = (struct scatterlist *) scsi_malloc(SCpnt->sglist_len);
/*
* Now fill the scatter-gather table.
*/
if (!sgpnt) {
/*
* If we cannot allocate the scatter-gather table, then
* simply write the first buffer all by itself.
*/
printk("Warning - running *really* short on DMA
buffers\n");
this_count = SCpnt->request.current_nr_sectors;
goto single_segment;
}
So I know that scsi_malloc failed, the reason, I don't know. Well I guess
either this test if (len % SECTOR_SIZE != 0 || len > PAGE_SIZE) or this one
if ((dma_malloc_freelist[i] & (mask << j)) == 0) failed (both come from
scsi_dma.c).
It is easily reproducible though: I just have to start a huge file transfer
from a crappy ide drive (ie low throughput) to a scsi one and the result is
almost garanteed.
I'm running a 2.4.19 + freeswan but I can easily get rid of freeswan if you
want me to (the kernel was compiled with gcc 2.96 20000731). The kernel has
nothing wierd enabled (it runs netfilter and nfs).
I've got both ide and scsi drives with ext3 as the only fs, the scsi card
is an Adaptec AHA-2940U2/U2W (oem version) and has only 3 devices
connected:
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: QUANTUM Model: ATLAS 10K 18SCA Rev: UCIE
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 05 Lun: 00
Vendor: TEAC Model: CD-R55S Rev: 1.0R
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
Vendor: HP Model: C1537A Rev: L706
Type: Sequential-Access ANSI SCSI revision: 02
The ide chip being the regular Intel Corp. 82801AA IDE (rev 02).
Here the output of /proc/slabinfo few minutes after the last line in the
logs:
slabinfo - version: 1.1 (SMP)
kmem_cache 80 80 244 5 5 1 : 252 126
fib6_nodes 9 226 32 2 2 1 : 252 126
ip6_dst_cache 16 40 192 2 2 1 : 252 126
ndisc_cache 1 30 128 1 1 1 : 252 126
ip_conntrack 84 132 352 12 12 1 : 124 62
tcp_tw_bucket 1 30 128 1 1 1 : 252 126
tcp_bind_bucket 22 226 32 2 2 1 : 252 126
tcp_open_request 0 0 96 0 0 1 : 252 126
inet_peer_cache 3 59 64 1 1 1 : 252 126
ip_fib_hash 23 226 32 2 2 1 : 252 126
ip_dst_cache 150 216 160 9 9 1 : 252 126
arp_cache 6 60 128 2 2 1 : 252 126
blkdev_requests 896 920 96 23 23 1 : 252 126
journal_head 293 390 48 5 5 1 : 252 126
revoke_table 8 253 12 1 1 1 : 252 126
revoke_record 0 0 32 0 0 1 : 252 126
dnotify cache 0 0 20 0 0 1 : 252 126
file lock cache 168 168 92 4 4 1 : 252 126
fasync cache 0 0 16 0 0 1 : 252 126
uid_cache 10 226 32 2 2 1 : 252 126
skbuff_head_cache 392 720 160 30 30 1 : 252 126
sock 97 148 928 37 37 1 : 124 62
sigqueue 29 29 132 1 1 1 : 252 126
cdev_cache 17 295 64 5 5 1 : 252 126
bdev_cache 12 118 64 2 2 1 : 252 126
mnt_cache 22 118 64 2 2 1 : 252 126
inode_cache 1390 2176 480 272 272 1 : 124 62
dentry_cache 1440 4620 128 154 154 1 : 252 126
filp 1455 1560 128 52 52 1 : 252 126
names_cache 2 2 4096 2 2 1 : 60 30
buffer_head 76691 77160 96 1929 1929 1 : 252 126
mm_struct 194 264 160 11 11 1 : 252 126
vm_area_struct 4303 4840 96 121 121 1 : 252 126
fs_cache 194 236 64 4 4 1 : 252 126
files_cache 130 153 416 17 17 1 : 124 62
signal_act 114 114 1312 38 38 1 : 60 30
size-131072(DMA) 0 0 131072 0 0 32 : 0 0
size-131072 0 0 131072 0 0 32 : 0 0
size-65536(DMA) 0 0 65536 0 0 16 : 0 0
size-65536 1 1 65536 1 1 16 : 0 0
size-32768(DMA) 0 0 32768 0 0 8 : 0 0
size-32768 3 3 32768 3 3 8 : 0 0
size-16384(DMA) 0 0 16384 0 0 4 : 0 0
size-16384 11 12 16384 11 12 4 : 0 0
size-8192(DMA) 0 0 8192 0 0 2 : 0 0
size-8192 9 10 8192 9 10 2 : 0 0
size-4096(DMA) 0 0 4096 0 0 1 : 60 30
size-4096 63 63 4096 63 63 1 : 60 30
size-2048(DMA) 0 0 2048 0 0 1 : 60 30
size-2048 252 282 2048 139 141 1 : 60 30
size-1024(DMA) 0 0 1024 0 0 1 : 124 62
size-1024 175 176 1024 44 44 1 : 124 62
size-512(DMA) 0 0 512 0 0 1 : 124 62
size-512 448 448 512 56 56 1 : 124 62
size-256(DMA) 0 0 256 0 0 1 : 252 126
size-256 263 270 256 18 18 1 : 252 126
size-128(DMA) 0 0 128 0 0 1 : 252 126
size-128 2128 2670 128 89 89 1 : 252 126
size-64(DMA) 0 0 64 0 0 1 : 252 126
size-64 718 2537 64 43 43 1 : 252 126
size-32(DMA) 0 0 32 0 0 1 : 252 126
size-32 1495 11413 32 101 101 1 : 252 126
And /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 394948608 389390336 5558272 0 15466496 310026240
Swap: 806068224 33927168 772141056
MemTotal: 385692 kB
MemFree: 5428 kB
MemShared: 0 kB
Buffers: 15104 kB
Cached: 298916 kB
SwapCached: 3844 kB
Active: 22192 kB
Inactive: 335208 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 385692 kB
LowFree: 5428 kB
SwapTotal: 787176 kB
SwapFree: 754044 kB
So my question: know issue (like memory fragmentation) or bug (in this case
I would be glad to test any patches you would want me to or to give you
anything missing in this email)?
Regards, Mathieu.
Oh BTW, just one thing, I wanted to give the throughput of the ide drived
but it failed:
Sep 25 23:18:32 bigip kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 25 23:18:32 bigip kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=102882, sector=102784
I read my logs every day so I know for sure these messages are new (damn it
doesn't look good)...
--
Mathieu Chouquet-Stringer E-Mail : [email protected]
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde
On Wed, Sep 25 2002, Mathieu Chouquet-Stringer wrote:
> Hello!
>
> I've upgraded a while to 2.4.19 and my box has been happy for the last 52
> days (it's a dual PIII). Tonight while going through my logs, I've found
> these:
>
> Sep 25 22:18:41 bigip kernel: Warning - running *really* short on DMA buffers
> Sep 25 22:18:47 bigip last message repeated 55 times
> Sep 25 22:19:41 bigip last message repeated 71 times
This is fixed in 2.4.20-pre
> Oh BTW, just one thing, I wanted to give the throughput of the ide drived
> but it failed:
> Sep 25 23:18:32 bigip kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> Sep 25 23:18:32 bigip kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=102882, sector=102784
Yep, looks like the end of the road for that drive.
--
Jens Axboe
Jens Axboe wrote:
> On Wed, Sep 25 2002, Mathieu Chouquet-Stringer wrote:
>
>> Hello!
>>
>>I've upgraded a while to 2.4.19 and my box has been happy for the last 52
>>days (it's a dual PIII). Tonight while going through my logs, I've found
>>these:
>>
>>Sep 25 22:18:41 bigip kernel: Warning - running *really* short on DMA buffers
>>Sep 25 22:18:47 bigip last message repeated 55 times
>>Sep 25 22:19:41 bigip last message repeated 71 times
>
>
> This is fixed in 2.4.20-pre
>
>
I reported this same problem some weeks ago -
http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 .
2.4.20pre kernels solved the error messages flooding the console, and
improved things a bit, but system load still got very high and disk read
and write performance was lousy. Adding more memory and using a
completely different machine didn't help. What did? Changing the Adaptec
scsi driver to aic7xxx_old . The performance was up 50% for writes and
90% for reads, and the system load was acceptable. And i didn't even had
to change the RedHat kernel (2.4.18-10) for a custom one. The storage
was two external Arena raid boxes, btw.
Regards,
Pedro
> I reported this same problem some weeks ago -
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 .
> 2.4.20pre kernels solved the error messages flooding the console, and
> improved things a bit, but system load still got very high and disk read
> and write performance was lousy. Adding more memory and using a
> completely different machine didn't help. What did? Changing the Adaptec
> scsi driver to aic7xxx_old . The performance was up 50% for writes and
> 90% for reads, and the system load was acceptable. And i didn't even had
> to change the RedHat kernel (2.4.18-10) for a custom one. The storage was
> two external Arena raid boxes, btw.
I would be interested in knowing if reducing the maximum tag depth for
the driver improves things for you. There is a large difference in the
defaults between the two drivers. It has only reacently come to my
attention that the SCSI layer per-transaction overhead is so high that
you can completely starve the kernel of resources if this setting is too
high. For example, a 4GB system installing RedHat 7.3 could not even
complete an install on a 20 drive system with the default of 253 commands.
The latest version of the aic7xxx driver already sent to Marcelo drops the
default to 32.
--
Justin
On Thu, Sep 26 2002, Justin T. Gibbs wrote:
> > I reported this same problem some weeks ago -
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 .
> > 2.4.20pre kernels solved the error messages flooding the console, and
> > improved things a bit, but system load still got very high and disk read
> > and write performance was lousy. Adding more memory and using a
> > completely different machine didn't help. What did? Changing the Adaptec
> > scsi driver to aic7xxx_old . The performance was up 50% for writes and
> > 90% for reads, and the system load was acceptable. And i didn't even had
> > to change the RedHat kernel (2.4.18-10) for a custom one. The storage was
> > two external Arena raid boxes, btw.
>
> I would be interested in knowing if reducing the maximum tag depth for
> the driver improves things for you. There is a large difference in the
> defaults between the two drivers. It has only reacently come to my
> attention that the SCSI layer per-transaction overhead is so high that
> you can completely starve the kernel of resources if this setting is too
> high. For example, a 4GB system installing RedHat 7.3 could not even
> complete an install on a 20 drive system with the default of 253 commands.
> The latest version of the aic7xxx driver already sent to Marcelo drops the
> default to 32.
2.4 layer is most horrible there, 2.5 at least gets rid of the old
scsi_dma crap. That said, 253 default depth is a bit over the top, no?
--
Jens Axboe, who always uses 4
On Thu, Sep 26 2002, Matthew Jacob wrote:
>
> > 2.4 layer is most horrible there, 2.5 at least gets rid of the old
> > scsi_dma crap. That said, 253 default depth is a bit over the top, no?
>
> Why? Something like a large Hitachi 9*** storage system can take ~1000
> tags w/o wincing.
Yeah, I bet that most of the devices attached to aic7xxx controllers are
exactly such beasts.
I didn't say that 253 is a silly default for _everything_, I think it's
a silly default for most users.
--
Jens Axboe
> 2.4 layer is most horrible there, 2.5 at least gets rid of the old
> scsi_dma crap. That said, 253 default depth is a bit over the top, no?
Why? Something like a large Hitachi 9*** storage system can take ~1000
tags w/o wincing.
> > > scsi_dma crap. That said, 253 default depth is a bit over the top, no?
> >
> > Why? Something like a large Hitachi 9*** storage system can take ~1000
> > tags w/o wincing.
>
> Yeah, I bet that most of the devices attached to aic7xxx controllers are
> exactly such beasts.
>
> I didn't say that 253 is a silly default for _everything_, I think it's
> a silly default for most users.
>
Well, no, I'm not sure I agree. In the expected life time of this
particular set of software that gets shipped out, the next generation of
100GB or better disk drives will be attached, and they'll likely eat all
of that many tags too, and usefully, considering the speed and bit
density of drives. For example, the current U160 Fujitsu drives will
take ~130 tags before sending back a QFULL.
On the other hand, we can also find a large class of existing devices
and situations where anything over 4 tags is overload too.
With some perspective on this, I'd have to say that in the last 25 years
I've seen more errors on the side of 'too conservative' for limits
rather than the opposite.
That said, the only problem with allowing such generous limits is the
impact on the system, which allows you to saturate as it does. Getting
that fixed is more important than saying a driver writer's choice for
limits is 'over the top'.
On Thu, Sep 26 2002, Matthew Jacob wrote:
>
> > > > scsi_dma crap. That said, 253 default depth is a bit over the top, no?
> > >
> > > Why? Something like a large Hitachi 9*** storage system can take ~1000
> > > tags w/o wincing.
> >
> > Yeah, I bet that most of the devices attached to aic7xxx controllers are
> > exactly such beasts.
> >
> > I didn't say that 253 is a silly default for _everything_, I think it's
> > a silly default for most users.
> >
>
> Well, no, I'm not sure I agree. In the expected life time of this
> particular set of software that gets shipped out, the next generation of
> 100GB or better disk drives will be attached, and they'll likely eat all
> of that many tags too, and usefully, considering the speed and bit
> density of drives. For example, the current U160 Fujitsu drives will
> take ~130 tags before sending back a QFULL.
Just because a device can eat XXX number of tags does definitely _not_
make it a good idea. At least not if you care the slightest bit about
latency.
> On the other hand, we can also find a large class of existing devices
> and situations where anything over 4 tags is overload too.
>
> With some perspective on this, I'd have to say that in the last 25 years
> I've seen more errors on the side of 'too conservative' for limits
> rather than the opposite.
At least for this tagging discussion, I'm of the exact opposite
oppinion. What's the worst that can happen with a tag setting that is
too low? Theoretical loss of disk bandwidth. I say theoretical, because
it's not even given that tags are that much faster then the Linux io
scheduler. More tags might even give you _worse_ throughput, because you
end up leaving the io scheduler with much less to work on (if you have a
253 depth to you device, you have 3 requests left for the queue...).
So I think the 'more tags the better!' belief is very much bogus, at
least for the common case.
--
Jens Axboe
>
> So I think the 'more tags the better!' belief is very much bogus, at
> least for the common case.
Well, that's one theory.
On Fri, Sep 27 2002, Matthew Jacob wrote:
> >
> > So I think the 'more tags the better!' belief is very much bogus, at
> > least for the common case.
>
> Well, that's one theory.
Numbers talk, theory spinning walks
Both Andrew and I did latency numbers for even small depths of tagging,
and the result was not pretty. Sure this is just your regular plaino
SCSI drives, however that's also what I care most about. People with
big-ass hardware tend to find a way to tweak them as well, I'd like the
typical systems to run fine out of the box though.
--
Jens Axboe
The issue here is not whether it's appropriate to oversaturate the
'standard' SCSI drive- it isn't- I never suggested it was.
I'd just suggest that it's asinine to criticise an HBA for running up to
reasonable limits when it's the non-toy OS that will do sensible I/O
scheduling. So point your gums elsewhere.
On Fri, 27 Sep 2002, Matthew Jacob wrote:
>
>
> On Fri, 27 Sep 2002, Jens Axboe wrote:
>
> > On Fri, Sep 27 2002, Matthew Jacob wrote:
> > > >
> > > > So I think the 'more tags the better!' belief is very much bogus, at
> > > > least for the common case.
> > >
> > > Well, that's one theory.
> >
> > Numbers talk, theory spinning walks
> >
> > Both Andrew and I did latency numbers for even small depths of tagging,
> > and the result was not pretty. Sure this is just your regular plaino
> > SCSI drives, however that's also what I care most about. People with
> > big-ass hardware tend to find a way to tweak them as well, I'd like the
> > typical systems to run fine out of the box though.
> >
>
> Fair enough.
>
>
>
>
>
On Fri, 27 Sep 2002, Jens Axboe wrote:
> On Fri, Sep 27 2002, Matthew Jacob wrote:
> > >
> > > So I think the 'more tags the better!' belief is very much bogus, at
> > > least for the common case.
> >
> > Well, that's one theory.
>
> Numbers talk, theory spinning walks
>
> Both Andrew and I did latency numbers for even small depths of tagging,
> and the result was not pretty. Sure this is just your regular plaino
> SCSI drives, however that's also what I care most about. People with
> big-ass hardware tend to find a way to tweak them as well, I'd like the
> typical systems to run fine out of the box though.
>
Fair enough.
On Fri, Sep 27 2002, Matthew Jacob wrote:
>
> The issue here is not whether it's appropriate to oversaturate the
> 'standard' SCSI drive- it isn't- I never suggested it was.
Ok so we agree. I think our oversaturate thresholds are different,
though.
> I'd just suggest that it's asinine to criticise an HBA for running up to
> reasonable limits when it's the non-toy OS that will do sensible I/O
> scheduling. So point your gums elsewhere.
Well I don't think 253 is a reasonable limit, that was the whole point.
How can sane io scheduling ever prevent starvation in that case? I can't
point my gums elsewhere, this is where I'm seeing starvation.
--
Jens Axboe
> On Fri, Sep 27 2002, Matthew Jacob wrote:
> >
> > The issue here is not whether it's appropriate to oversaturate the
> > 'standard' SCSI drive- it isn't- I never suggested it was.
>
> Ok so we agree. I think our oversaturate thresholds are different,
> though.
I think we simply disagree as to where to put them. See below.
>
> > I'd just suggest that it's asinine to criticise an HBA for running up to
> > reasonable limits when it's the non-toy OS that will do sensible I/O
> > scheduling. So point your gums elsewhere.
>
> Well I don't think 253 is a reasonable limit, that was the whole point.
> How can sane io scheduling ever prevent starvation in that case? I can't
> point my gums elsewhere, this is where I'm seeing starvation.
You're in starvation because the I/O midlayer and buffer cache are
allowing you to build enough transactions on one bus to impact system
response times. This is an old problem with Linux that comes and goes
(as it has come and gone for most systems). There are a number of
possible solutions to this problem- but because this is in 2.4 perhaps
the most sensible one is to limit how much you *ask* from an HBA,
perhaps based upon even as vague a set of parameters as CPU speed and
available memory divided by the number of <n-scsibus/total spindles).
It's the job of the HBA driver to manage resources on the the HBA and on
the bus the HBA interfaces to. If the HBA and its driver can efficiently
manage 1000 concurrent commands per lun and 16384 luns per target and
500 'targets' in a fabric, let it.
Let oversaturation of a *spindle* be informed by device quirks and the
rate of QFULLs received, or even, if you will, by finding the knee in
the per-command latency curve (if you can and you think that it's
meaningful). Let oversaturation of the system be done elsewhere- let the
buffer cache manager and system policy knobs decide whether the fact
that the AIC driver is so busy moving I/O that the user can't get window
focus onto the window in N-P complete time to kill the runaway tar.
Sorry- an overlong response. It *is* easier to just say "well, 'fix' the
HBA driver so it doesn't allow the system to get too busy or
overloaded". But it seems to me that this is even best solved in the
midlayer which should, in fact, know best (better than a single HBA).
-matt
On Fri, Sep 27 2002, Matthew Jacob wrote:
> > > I'd just suggest that it's asinine to criticise an HBA for running up to
> > > reasonable limits when it's the non-toy OS that will do sensible I/O
> > > scheduling. So point your gums elsewhere.
> >
> > Well I don't think 253 is a reasonable limit, that was the whole point.
> > How can sane io scheduling ever prevent starvation in that case? I can't
> > point my gums elsewhere, this is where I'm seeing starvation.
>
> You're in starvation because the I/O midlayer and buffer cache are
> allowing you to build enough transactions on one bus to impact system
> response times. This is an old problem with Linux that comes and goes
That's one type of starvation, but that's one that can be easily
prevented by the os. Not a problem.
The starvation I'm talking about is the drive starving requests. Just
keep it moderately busy (10-30 outstanding tags), and a read can take a
long time to complete. The hba can try to prevent this from happening by
issuing ever Xth request as on ordered tag, however the fact that we
even need to consider doing this suggests to me that something is
broken. That a drive can starve a single request for that long is _bad_.
Issuing every 1024th request as ordered helps absolutely zip for good
interactive behaviour. It might help on a single request basis, but good
latency feel typically requires more than that.
We _can_ try and prevent this type of starvation. If we encounter a
read, don't queue any more writes to the drive before it completes.
That's some simple logic that will probably help a lot. This is the type
of thing the deadline io scheduler tries to do. This is working around
broken drive firmware in my oppinion, the drive shouldn't be starving
requests like that.
However, it's stupid to try and work around a problem if we can simply
prevent the problem in the first place. What is the problem? It's lots
of tags causing the drive to starve requests. Do we need lots of tags?
In my experience 4 tags is more than plenty for a typical drive, there's
simply _no_ benefit from going beyond that. It doesn't buy you extra
throughput, it doesn't buy you better io scheduling (au contraire). All
it gets you is lots of extra latency. So why would I want lots of tags
on a typical scsi drive? I don't.
> (as it has come and gone for most systems). There are a number of
> possible solutions to this problem- but because this is in 2.4 perhaps
> the most sensible one is to limit how much you *ask* from an HBA,
> perhaps based upon even as vague a set of parameters as CPU speed and
> available memory divided by the number of <n-scsibus/total spindles).
This doesn't make much sense to me. Why would the CPU speed and
available memory impact this at all? We don't want to deplete system
resources (the case Justin mentioned), of course, but beyond that I
don't think it makes much sense.
> It's the job of the HBA driver to manage resources on the the HBA and on
> the bus the HBA interfaces to. If the HBA and its driver can efficiently
> manage 1000 concurrent commands per lun and 16384 luns per target and
> 500 'targets' in a fabric, let it.
Yes, if the hba and its drive _and_ the target can _efficiently_ handle
it, I'm all for it. Again you seem to be comparing the typical scsi hard
drive to more esoteric devices. I'll note again that I'm not talking
about such devices.
> Let oversaturation of a *spindle* be informed by device quirks and the
> rate of QFULLs received, or even, if you will, by finding the knee in
If device quirks are that 90% (pulling this number out of my ass) of
scsi drives use pure internal sptf scheduling and thus heavily starve
requests, then why bother? Queue full contains no latency information.
> the per-command latency curve (if you can and you think that it's
I can try to get a decent default. 253 clearly isn't it, far from it.
> meaningful). Let oversaturation of the system be done elsewhere- let the
> buffer cache manager and system policy knobs decide whether the fact
> that the AIC driver is so busy moving I/O that the user can't get window
> focus onto the window in N-P complete time to kill the runaway tar.
This is not a problem with the vm flooding a spindle. We want it to be
flooded, the more we can shove into the io scheduler to work with, the
better chance it has of doing a good job.
> Sorry- an overlong response. It *is* easier to just say "well, 'fix' the
> HBA driver so it doesn't allow the system to get too busy or
> overloaded". But it seems to me that this is even best solved in the
> midlayer which should, in fact, know best (better than a single HBA).
Agrh. Who's saying 'fix' the hba driver? Either I'm not expressing
myself very clearly, or you are simply not reading what I write.
--
Jens Axboe
[ .. all sorts of nice discussion, but not on our argument point ]
>
> Agrh. Who's saying 'fix' the hba driver? Either I'm not expressing
> myself very clearly, or you are simply not reading what I write.
I (foolishly) leapt in when you said "253 is 'over the top'". You seemed
to imply that the aic7xxx driver was at fault and should be limiting the
amount it is sending out. My (mostly) only beef with what you've written
is with that implication- mainly as "don't send so many damned commands
if you think they're too many". If the finger pointing at aic7xx is not
what you're implying, then this has been a waste of email bandwidth-
sorry.
-matt
Justin T. Gibbs wrote:
>> I reported this same problem some weeks ago -
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 .
>>2.4.20pre kernels solved the error messages flooding the console, and
>>improved things a bit, but system load still got very high and disk read
>>and write performance was lousy. Adding more memory and using a
>>completely different machine didn't help. What did? Changing the Adaptec
>>scsi driver to aic7xxx_old . The performance was up 50% for writes and
>>90% for reads, and the system load was acceptable. And i didn't even had
>>to change the RedHat kernel (2.4.18-10) for a custom one. The storage was
>>two external Arena raid boxes, btw.
>
>
> I would be interested in knowing if reducing the maximum tag depth for
> the driver improves things for you. There is a large difference in the
> defaults between the two drivers. It has only reacently come to my
> attention that the SCSI layer per-transaction overhead is so high that
> you can completely starve the kernel of resources if this setting is too
> high. For example, a 4GB system installing RedHat 7.3 could not even
> complete an install on a 20 drive system with the default of 253 commands.
> The latest version of the aic7xxx driver already sent to Marcelo drops the
> default to 32.
>
> --
> Justin
>
>
>
I have a server available to test it, but the storage in question is
already deployed. Yet, by luck (irony apart) i have a maintenance window
this weekend for tuning and other matters, i can decrease the maximum
number of TCQ commands per device in the proper aic7xxx driver to 32 and
report on the results. While trying to solve this problem i browsed
RedHat's bugzilla, and there were several people burned with this
problem. Hope this sorts it out for them.
/Pedro
On Fri, Sep 27 2002, Matthew Jacob wrote:
>
> [ .. all sorts of nice discussion, but not on our argument point ]
> >
> > Agrh. Who's saying 'fix' the hba driver? Either I'm not expressing
> > myself very clearly, or you are simply not reading what I write.
>
> I (foolishly) leapt in when you said "253 is 'over the top'". You seemed
> to imply that the aic7xxx driver was at fault and should be limiting the
> amount it is sending out. My (mostly) only beef with what you've written
> is with that implication- mainly as "don't send so many damned commands
> if you think they're too many". If the finger pointing at aic7xx is not
> what you're implying, then this has been a waste of email bandwidth-
> sorry.
It's not aimed at any specific hba driver, it could be any. 253 would be
over the top for any of them, it just so happens that aic7xxx has this
as the default :-)
So while it is definitely not the aic7xxx driver doing the starvation
(it's the device), the aic7xxx driver is (_in my oppinion) somewhat at
fault for setting it so high _as a default_.
Hopefully that's the end of this thread :)
--
Jens Axboe
> The starvation I'm talking about is the drive starving requests. Just
> keep it moderately busy (10-30 outstanding tags), and a read can take a
> long time to complete.
As I tried to explain to Andrew just the other day, this is neither a
drive nor HBA problem. You've essentially constructed a benchmark where
a single process can get so far ahead of the I/O subsystem in terms of
buffered writes that there is no choice but for there to be a long delay
for the device to handle your read. Consider that because you are queuing,
the drive will completely fill its cache with write data that is pending
to hit the platters. The number of transactions in the cache is marginally
dependant on the number of tags in use since that will affect the ability
of the controller to saturate the drive cache with write data. Depending
on your drive, mode page settings, etc, the drive may allow your read to
pass the write, but many do not. So you have to wait for the cache to
at least have space to handle your read and perhaps have even additional
write data flush before your read can even be started. If you don't like
this behavior, which actually maximizes the throughput of the device, have
the I/O scheduler hold back a single processes from creating such a large
backlog. You can also read the SCSI spec and tune your disk to behave
differently.
Now consider the read case. I maintain that any reasonable drive will
*always* outperform the OS's transaction reordering/elevator algorithms
for seek reduction. This is the whole point of having high tag depths.
In all I/O studies that have been performed todate, reads far outnumber
writes *unless* you are creating an ISO image on your disk. In my opinion
it is much more important to optimize for the more common, concurrent
read case, than it is for the sequential write case with intermittent
reads. Of course, you can fix the latter case too without any change to
the driver's queue depth as outlined above. Why not have your cake and
eat it too?
--
Justin
On Fri, Sep 27 2002, Justin T. Gibbs wrote:
> > The starvation I'm talking about is the drive starving requests. Just
> > keep it moderately busy (10-30 outstanding tags), and a read can take a
> > long time to complete.
>
> As I tried to explain to Andrew just the other day, this is neither a
> drive nor HBA problem. You've essentially constructed a benchmark where
> a single process can get so far ahead of the I/O subsystem in terms of
> buffered writes that there is no choice but for there to be a long delay
> for the device to handle your read. Consider that because you are queuing,
> the drive will completely fill its cache with write data that is pending
> to hit the platters. The number of transactions in the cache is marginally
> dependant on the number of tags in use since that will affect the ability
> of the controller to saturate the drive cache with write data. Depending
> on your drive, mode page settings, etc, the drive may allow your read to
> pass the write, but many do not. So you have to wait for the cache to
> at least have space to handle your read and perhaps have even additional
> write data flush before your read can even be started. If you don't like
> this behavior, which actually maximizes the throughput of the device, have
> the I/O scheduler hold back a single processes from creating such a large
> backlog. You can also read the SCSI spec and tune your disk to behave
> differently.
If the above is what has been observed in the real world, then there
would be no problem. Lets say I have 32 tags pending, all writes. Now I
issue a read. Then I go ahead and through my writes at the drive,
basically keeping it at 32 tags all the time. When will this read
complete? The answer is, well it might not within any reasonable time,
because the drive happily starves the read to get the best write
throughput.
The size of the dirty cache back log, or whatever you want to call it,
does not matter _at all_. I don't know why both you and Matt keep
bringing that point up. The 'back log' is just that, it will be
processed in due time. If a read comes in, the io scheduler will decide
it's the most important thing on earth. So I may have 1 gig of dirty
cache waiting to be flushed to disk, that _does not_ mean that the read
that now comes in has to wait for the 1 gig to be flushed first.
> Now consider the read case. I maintain that any reasonable drive will
> *always* outperform the OS's transaction reordering/elevator algorithms
> for seek reduction. This is the whole point of having high tag depths.
Well given that the drive has intimate knowledge of itself, then yes of
course it is the only one that can order any number of pending requests
most optimally. So the drive might provide the best layout of requests
when it comes to total number of seek time spent, and throughput. But
often at the cost of increased (some times much, see the trivial
examples given) latency.
However, I maintain that going beyond any reasonable number of tags for
a standard drive is *stupid*. The Linux io scheduler gets very good
performance without any queueing at all. Going from 4 to 64 tags gets
you very very little increase in performance, if any at all.
> In all I/O studies that have been performed todate, reads far outnumber
> writes *unless* you are creating an ISO image on your disk. In my opinion
Well it's my experience that it's pretty balanced, at least for my own
workload. atime updates and compiles etc put a nice load on writes.
> it is much more important to optimize for the more common, concurrent
> read case, than it is for the sequential write case with intermittent
> reads. Of course, you can fix the latter case too without any change to
> the driver's queue depth as outlined above. Why not have your cake and
> eat it too?
If you care to show me this cake, I'd be happy to devour it. I see
nothing even resembling a solution to this problem in your email, except
from you above saying I should ignore it and optimize for 'the common'
concurrent read case.
It's pointless to argue that tagging is oh so great and always
outperforms the os io scheduler, and that we should just use 253 tags
because the drive knwos best, when several examples have shown that this
is _not the case_.
--
Jens Axboe
On Fri, 27 Sep 2002, Justin T. Gibbs wrote:
> writes *unless* you are creating an ISO image on your disk. In my opinion
> it is much more important to optimize for the more common, concurrent
> read case, than it is for the sequential write case with intermittent
> reads.
You're missing the point. The only reason the reads are
intermittent is that the application can't proceed until
the read is done and the read is being starved by writes.
If the read was serviced immediately, the next read could
get scheduled quickly and they wouldn't be intermittant.
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Spamtraps of the month: [email protected] [email protected]
> If you don't like this behavior, which actually maximizes the
> throughput of the device, have the I/O scheduler hold back a single
> processes from creating such a large backlog.
Justin and I are (for once) in 100% agreement.
On Fri, Sep 27 2002, Matthew Jacob wrote:
>
> > If you don't like this behavior, which actually maximizes the
> > throughput of the device, have the I/O scheduler hold back a single
> > processes from creating such a large backlog.
>
>
> Justin and I are (for once) in 100% agreement.
Well Justin and you are both, it seems, missing the point.
I'm now saying for the 3rd time, that there's zero problem in having a
huge dirty cache backlog. This is not the problem, please disregard any
reference to that. Count only the time spent for servicing a read
request, _from when it enters the drive_ and until it completes. IO
scheduler is _not_ involved.
--
Jens Axboe
> If the above is what has been observed in the real world, then there
> would be no problem. Lets say I have 32 tags pending, all writes. Now I
> issue a read. Then I go ahead and through my writes at the drive,
> basically keeping it at 32 tags all the time. When will this read
> complete? The answer is, well it might not within any reasonable time,
> because the drive happily starves the read to get the best write
> throughput.
Just because you use 32 or 4 or 8 or whatever tags you cannot know the
number of commands still in the drive's cache. Have you disabled
turned off the WCE bit on your drive and retested your latency numbers?
> The size of the dirty cache back log, or whatever you want to call it,
> does not matter _at all_. I don't know why both you and Matt keep
> bringing that point up. The 'back log' is just that, it will be
> processed in due time. If a read comes in, the io scheduler will decide
> it's the most important thing on earth. So I may have 1 gig of dirty
> cache waiting to be flushed to disk, that _does not_ mean that the read
> that now comes in has to wait for the 1 gig to be flushed first.
But it does matter. If single process can fill the drive's or array's
cache with silly write data as well as have all outstanding tags busy
on its writes, you will incur a significant delay. No single process
should be allowed to do that. It doesn't matter that the read becomes
the most important thing on earth to the OS, you can't take back what
you've already issued to the device. Sorry. It doesn't work that
way.
> However, I maintain that going beyond any reasonable number of tags for
> a standard drive is *stupid*. The Linux io scheduler gets very good
> performance without any queueing at all. Going from 4 to 64 tags gets
> you very very little increase in performance, if any at all.
Under what benchmarks? http load? Squid, News, or mail simulations?
All I've seen are examples crafted to prove your point that I don't
think mirror real world workloads.
>> In all I/O studies that have been performed todate, reads far outnumber
>> writes *unless* you are creating an ISO image on your disk. In my
>> opinion
>
> Well it's my experience that it's pretty balanced, at least for my own
> workload. atime updates and compiles etc put a nice load on writes.
These are very differnet than the "benchmark" I've seen used in this
dicussion:
dd if=/dev/zero of=somefile bs=1M &
cat somefile.
Have you actually timed some of your common activities (say a full
build of the Linux kernel w/modules) at different tag depths, with
or without write caching enabled, etc?
> If you care to show me this cake, I'd be happy to devour it. I see
> nothing even resembling a solution to this problem in your email, except
> from you above saying I should ignore it and optimize for 'the common'
> concurrent read case.
Take a look inside True64 (I believe there are a few papers about this)
to see how to use command response times to modulate device workload.
FreeBSD has several algorithms in its VM to prevent a single process
from holding onto too many dirty buffers. FreeBSD, Solaris, True64,
even WindowsNT have effective algorithms for sanely retiring dirty
buffers without saturating the system. All of these algorithms have
been discussed at length in conference papers. You just need to go
do a google search. None of these issues are new and the solutions
are not novel.
> It's pointless to argue that tagging is oh so great and always
> outperforms the os io scheduler, and that we should just use 253 tags
> because the drive knwos best, when several examples have shown that this
> is _not the case_.
You are trying to solve these problems at the wrong level.
--
Justin
> I'm now saying for the 3rd time, that there's zero problem in having a
> huge dirty cache backlog. This is not the problem, please disregard any
> reference to that. Count only the time spent for servicing a read
> request, _from when it enters the drive_ and until it completes. IO
> scheduler is _not_ involved.
On the drive? That's all I've been saying.
--
Justin
On Fri, 27 Sep 2002, Justin T. Gibbs wrote:
> FreeBSD has several algorithms in its VM to prevent a single process
> from holding onto too many dirty buffers. FreeBSD, Solaris, True64,
> even WindowsNT have effective algorithms for sanely retiring dirty
> buffers without saturating the system.
I guess those must be bad for dbench, bonnie or other critical
server applications ;)
*runs like hell*
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Spamtraps of the month: [email protected] [email protected]