FUJITA Tomonori wrote:
[...]
> Do we really need this static array? Allocating struct ctlr_info
> dynamically is fine?
Should be no problem to fix that.
[...]
> > + .this_id = -1,
> > + .sg_tablesize = MAXSGENTRIES,
>
> MAXSGENTRIES (32) is the limitation of hardware? If not, it might be
> better to enlarge this for better performance?
Yes, definitely, though this value varies from controller to controller,
so this is just a default value that needs to be overridden, probably
in hpsa_scsi_detect().
[...]
> > + .cmd_per_lun = 512,
> > + .use_clustering = DISABLE_CLUSTERING,
>
> Why can we use ENABLE_CLUSTERING here? We would get the better
> performance with ENABLE_CLUSTERING.
Yes, we should do that. BTW, the comments in include/linux/scsi_host.h
don't do a very good job of describing exactly what use_clustering is for,
they say:
/*
* True if this host adapter can make good use of clustering.
* I originally thought that if the tablesize was large that it
* was a waste of CPU cycles to prepare a cluster list, but
* it works out that the Buslogic is faster if you use a smaller
* number of segments (i.e. use clustering). I guess it is
* inefficient.
*/
It never actually tells you what is meant by "clustering"
[...]
> > +static inline void set_bit_in_array(__u8 bitarray[], int bit)
> > +{
> > + bitarray[bit >> 3] |= (1 << (bit & 0x07));
> > +}
>
> Can not we use the standard bit operation functions instead?
Yeah, that should be no problem. I was thinking as I typed that
code in, "there's probably some way already defined to do this",
meaning to fix it later, but then I forgot to fix it.
[...]
> > + use_sg = scsi_dma_map(cmd);
> > + if (!use_sg)
> > + goto sglist_finished;
>
> We need to handle dma mapping failure here; scsi_dma_map could fail.
Grepping around a bit in drivers scsi I see some drivers do this:
SCpnt->result = DID_ERROR << 16;
then call the scsi done function,
some drivers call BUG_ON() when scsi_dma_map() returns -1,
and some do nothing.
I'm guessing setting result = DID_ERROR << 16 and calling
the done function is the way to go, right?
[...]
> > + /* Get the ptr to our adapter structure (hba[i]) out of cmd->host. */
> > + h = (struct ctlr_info *) cmd->device->host->hostdata[0];
>
> Let's use shost_priv().
Oh, ok. I think maybe that didn't exist when I first wrote that code.
[...]
> > + /* Need a lock as this is being allocated from the pool */
> > + spin_lock_irqsave(&h->lock, flags);
> > + cp = cmd_alloc(h);
> > + spin_unlock_irqrestore(&h->lock, flags);
> > + if (cp == NULL) { /* trouble... */
>
> We run out of commands here. Returning SCSI_MLQUEUE_HOST_BUSY is
> appropriate here, I think.
Ok.
[...]
>
> But if we allocate shost->can_queue at startup, we can't run out of
> commands.
Yeah, we shouldn't run out. That check is kind of paranoia.
[...]
> > +}
> > +#endif /* CONFIG_PROC_FS */
>
> We really need this? Creating something under /proc is not good. Using
> /sys/class/scsi_host/ is the proper way. If we remove the overlap
> between hpsa and cciss, we can do the proper way, I think.
We can take it out. We figured we'd take it out when
someone complained, which we figured would probably
happen pretty much immediately.
[...]
> > +/*
> > + * For operations that cannot sleep, a command block is allocated at init,
> > + * and managed by cmd_alloc() and cmd_free() using a simple bitmap to track
> > + * which ones are free or in use. Lock must be held when calling this.
> > + * cmd_free() is the complement.
> > + */
> > +static struct CommandList_struct *cmd_alloc(struct ctlr_info *h)
> > +{
> > + struct CommandList_struct *c;
> > + int i;
> > + union u64bit temp64;
> > + dma_addr_t cmd_dma_handle, err_dma_handle;
> > +
> > + do {
> > + i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> > + if (i == h->nr_cmds)
> > + return NULL;
> > + } while (test_and_set_bit
> > + (i & (BITS_PER_LONG - 1),
> > + h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
>
> Using bitmap to manage free commands looks too complicated a bit to
> me. Can we just use lists for command management?
Hmm, this doesn't seem all that complicated to me, and this code snippet
has been pretty stable for about 10 years. it's nearly identical to what's in
cpqarray in the 2.2.13 kernel from 1999:
do {
i = find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
if (i == NR_CMDS)
return NULL;
} while(test_and_set_bit(i%32, h->cmd_pool_bits+(i/32)) != 0)
It's fast, works well, and has needed very little maintenance over the
years. Without knowing what you have in mind specifically, I don't see a
big need to change this.
-- steve
On Mon, 2 Mar 2009 08:56:50 -0600
[email protected] wrote:
> [...]
> > > + .this_id = -1,
> > > + .sg_tablesize = MAXSGENTRIES,
> >
> > MAXSGENTRIES (32) is the limitation of hardware? If not, it might be
> > better to enlarge this for better performance?
>
> Yes, definitely, though this value varies from controller to controller,
> so this is just a default value that needs to be overridden, probably
> in hpsa_scsi_detect().
I see. If we override this in hpsa_scsi_detect(), we need a trick for
SG in CommandList_struct, I guess.
> [...]
> > > + .cmd_per_lun = 512,
> > > + .use_clustering = DISABLE_CLUSTERING,
> >
> > Why can we use ENABLE_CLUSTERING here? We would get the better
> > performance with ENABLE_CLUSTERING.
>
> Yes, we should do that. BTW, the comments in include/linux/scsi_host.h
> don't do a very good job of describing exactly what use_clustering is for,
> they say:
> /*
> * True if this host adapter can make good use of clustering.
> * I originally thought that if the tablesize was large that it
> * was a waste of CPU cycles to prepare a cluster list, but
> * it works out that the Buslogic is faster if you use a smaller
> * number of segments (i.e. use clustering). I guess it is
> * inefficient.
> */
>
> It never actually tells you what is meant by "clustering"
Yeah, looks like it needs a fix.
> > > + use_sg = scsi_dma_map(cmd);
> > > + if (!use_sg)
> > > + goto sglist_finished;
> >
> > We need to handle dma mapping failure here; scsi_dma_map could fail.
>
> Grepping around a bit in drivers scsi I see some drivers do this:
>
> SCpnt->result = DID_ERROR << 16;
>
> then call the scsi done function,
>
> some drivers call BUG_ON() when scsi_dma_map() returns -1,
> and some do nothing.
These drivers are bad. Well, in ancient times dma_map_sg never failed
on X86. So BUG_ON or ignoring is acceptable for drivers for ancient
systems.
But nowadays dma_map_sg can fail (e.g. with Intel VT-D IOMMU).
> I'm guessing setting result = DID_ERROR << 16 and calling
> the done function is the way to go, right?
Not. It's a temporary error, kinda out-of-memory. So we want to retry.
returning SCSI_MLQUEUE_HOST_BUSY is appropriate here.
> > We really need this? Creating something under /proc is not good. Using
> > /sys/class/scsi_host/ is the proper way. If we remove the overlap
> > between hpsa and cciss, we can do the proper way, I think.
>
> We can take it out. We figured we'd take it out when
> someone complained, which we figured would probably
> happen pretty much immediately.
I see, please drop this. This is an issue that we need to take care
about before mainline merging.
> > > + * For operations that cannot sleep, a command block is allocated at init,
> > > + * and managed by cmd_alloc() and cmd_free() using a simple bitmap to track
> > > + * which ones are free or in use. Lock must be held when calling this.
> > > + * cmd_free() is the complement.
> > > + */
> > > +static struct CommandList_struct *cmd_alloc(struct ctlr_info *h)
> > > +{
> > > + struct CommandList_struct *c;
> > > + int i;
> > > + union u64bit temp64;
> > > + dma_addr_t cmd_dma_handle, err_dma_handle;
> > > +
> > > + do {
> > > + i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> > > + if (i == h->nr_cmds)
> > > + return NULL;
> > > + } while (test_and_set_bit
> > > + (i & (BITS_PER_LONG - 1),
> > > + h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
> >
> > Using bitmap to manage free commands looks too complicated a bit to
> > me. Can we just use lists for command management?
>
> Hmm, this doesn't seem all that complicated to me, and this code snippet
> has been pretty stable for about 10 years. it's nearly identical to what's in
> cpqarray in the 2.2.13 kernel from 1999:
>
> do {
> i = find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
> if (i == NR_CMDS)
> return NULL;
> } while(test_and_set_bit(i%32, h->cmd_pool_bits+(i/32)) != 0)
>
> It's fast, works well, and has needed very little maintenance over the
> years. Without knowing what you have in mind specifically, I don't see a
> big need to change this.
I see. Seems that some drivers want something similar. I might come
back later on with a patch to replace this with library
functions.
On Tue, Mar 03, 2009 at 03:35:26PM +0900, FUJITA Tomonori wrote:
> On Mon, 2 Mar 2009 08:56:50 -0600
> [email protected] wrote:
>
> > [...]
> > > > + .this_id = -1,
> > > > + .sg_tablesize = MAXSGENTRIES,
> > >
> > > MAXSGENTRIES (32) is the limitation of hardware? If not, it might be
> > > better to enlarge this for better performance?
> >
> > Yes, definitely, though this value varies from controller to controller,
> > so this is just a default value that needs to be overridden, probably
> > in hpsa_scsi_detect().
>
> I see. If we override this in hpsa_scsi_detect(), we need a trick for
> SG in CommandList_struct, I guess.
Yes. There are some limits to what can be put into CommandList_struct
directly, but there is also scatter gather chaining, in which we use
the last element in the CommandList_struct to point to another buffer
of SG entries.
If you have a system with a lot of controllers, having a large number of
scatter gathers can be a bit of a memory hog, and since this memory is all
via pci_alloc_consistent, that can be a concern. It would be nice if
there was a way for the user to specify differing amounts of scatter
gathers for different controller instances so for instance the controller
which he's running his big oracle database, or webserver or whatever on
gets lots, while the controller he's booted from that's mostly idle
gets not so many. I don't know what a good way for a user to identify
what controller he's talking about in a module parameter would be
though. Maybe by pci domain/bus/device/function? Maybe something along
the lines of:
modprobe hpsa dev1=0:0e:00.0 sg1=1000 dev2=0:0b:00.0 sg2=31
to say that one controller gets 1000 scatter gather elements, but
another gets only 31. But PCI busses can change if hardware
configuration changes, and this isn't exactly obvious, so seems less
than ideal. Any bright ideas on that front?
We have some specialized versions of cciss around that have variable
sized SG arrays in CommandList_struct as well as doing scatter gather
chaining (not to be confused with the scatter gather chaining concept
in the scsi mid layer.)
[...snip...]
>
>
> > > > + * For operations that cannot sleep, a command block is allocated at init,
> > > > + * and managed by cmd_alloc() and cmd_free() using a simple bitmap to track
> > > > + * which ones are free or in use. Lock must be held when calling this.
> > > > + * cmd_free() is the complement.
> > > > + */
> > > > +static struct CommandList_struct *cmd_alloc(struct ctlr_info *h)
> > > > +{
> > > > + struct CommandList_struct *c;
> > > > + int i;
> > > > + union u64bit temp64;
> > > > + dma_addr_t cmd_dma_handle, err_dma_handle;
> > > > +
> > > > + do {
> > > > + i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> > > > + if (i == h->nr_cmds)
> > > > + return NULL;
> > > > + } while (test_and_set_bit
> > > > + (i & (BITS_PER_LONG - 1),
> > > > + h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
> > >
> > > Using bitmap to manage free commands looks too complicated a bit to
> > > me. Can we just use lists for command management?
> >
> > Hmm, this doesn't seem all that complicated to me, and this code snippet
> > has been pretty stable for about 10 years. it's nearly identical to what's in
> > cpqarray in the 2.2.13 kernel from 1999:
> >
> > do {
> > i = find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
> > if (i == NR_CMDS)
> > return NULL;
> > } while(test_and_set_bit(i%32, h->cmd_pool_bits+(i/32)) != 0)
> >
> > It's fast, works well, and has needed very little maintenance over the
> > years. Without knowing what you have in mind specifically, I don't see a
> > big need to change this.
>
> I see. Seems that some drivers want something similar. I might come
> back later on with a patch to replace this with library
> functions.
There was some other discussion about pushing this sort of thing to
upper layers, using a tag generated in the scsi layer as a means
of allocating driver command buffers, since, presumably there's a
one to one mapping. (I didn't completely grok it all though.)
-- steve
[email protected] wrote:
> [...]
>>> +/*
>>> + * For operations that cannot sleep, a command block is allocated at init,
>>> + * and managed by cmd_alloc() and cmd_free() using a simple bitmap to track
>>> + * which ones are free or in use. Lock must be held when calling this.
>>> + * cmd_free() is the complement.
>>> + */
>>> +static struct CommandList_struct *cmd_alloc(struct ctlr_info *h)
>>> +{
>>> + struct CommandList_struct *c;
>>> + int i;
>>> + union u64bit temp64;
>>> + dma_addr_t cmd_dma_handle, err_dma_handle;
>>> +
>>> + do {
>>> + i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
>>> + if (i == h->nr_cmds)
>>> + return NULL;
>>> + } while (test_and_set_bit
>>> + (i & (BITS_PER_LONG - 1),
>>> + h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
>> Using bitmap to manage free commands looks too complicated a bit to
>> me. Can we just use lists for command management?
>
> Hmm, this doesn't seem all that complicated to me, and this code snippet
> has been pretty stable for about 10 years. it's nearly identical to what's in
> cpqarray in the 2.2.13 kernel from 1999:
>
> do {
> i = find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
> if (i == NR_CMDS)
> return NULL;
> } while(test_and_set_bit(i%32, h->cmd_pool_bits+(i/32)) != 0)
>
> It's fast, works well, and has needed very little maintenance over the
> years. Without knowing what you have in mind specifically, I don't see a
> big need to change this.
>
Other drivers have had to convert to and modify the host tagging to get
merged. They too had stable and fast code, and we complained and fought
against changing it :) I have had to convert or help convert libfc and
qla4xxx and I will now convert iscsi, so I feel your pain :)
To create the map call scsi_init_shared_tag_map after you allocate the
scsi host and before you add it. Then in slave_alloc set the
sdev->tag_supported = 1 and call scsi_activate_tcq. Then replace your
bitmap with:
for scsi commands:
c = h->cmd_pool + scsi_cmnd->request->tag
And for the reset path I was thinking you could use my patch in the
other mail and do
tag = blk_map_start_tag(scsi_host->bqt, NULL, 0);
if (tag < 0)
goto fail;
c = h->cmd_pool + tag;
c->cmdindex = tag;
In the completion path you need to do a blk_map_end_tag(scsi_host->bqt,
c->cmdindex);. The scsi/block layer will free the tag for scsi commadns.
On Tue, Mar 03, 2009 at 10:49:15AM -0600, Mike Christie wrote:
> [email protected] wrote:
[...]
> >It's fast, works well, and has needed very little maintenance over the
> >years. Without knowing what you have in mind specifically, I don't see a
> >big need to change this.
> >
>
> Other drivers have had to convert to and modify the host tagging to get
> merged. They too had stable and fast code, and we complained and fought
> against changing it :) I have had to convert or help convert libfc and
> qla4xxx and I will now convert iscsi, so I feel your pain :)
I wasn't complaining, or even resisting actually. What I was replying to
there suggested we use "lists", with no more detail than that specified,
and not anything like what you describe below, so it wasn't clear to me
that anything concrete was being proposed instead of what we had.
Given what was written, it seemed to be just complaining about
some code that looked a little bit complicated. So *that* I was
resisting a bit, or at least pushing for some justification, but if
there's an already established way to share the command allocation logic
between the scsi layer and low level driver as you describe I've got no
problem with that.
>
> To create the map call scsi_init_shared_tag_map after you allocate the
> scsi host and before you add it. Then in slave_alloc set the
> sdev->tag_supported = 1 and call scsi_activate_tcq. Then replace your
> bitmap with:
>
> for scsi commands:
>
> c = h->cmd_pool + scsi_cmnd->request->tag
>
> And for the reset path I was thinking you could use my patch in the
> other mail and do
>
> tag = blk_map_start_tag(scsi_host->bqt, NULL, 0);
> if (tag < 0)
> goto fail;
> c = h->cmd_pool + tag;
> c->cmdindex = tag;
>
> In the completion path you need to do a blk_map_end_tag(scsi_host->bqt,
> c->cmdindex);. The scsi/block layer will free the tag for scsi commadns.
That makes sense. Thanks.
-- steve
On Tue, 3 Mar 2009 10:28:21 -0600
[email protected] wrote:
> On Tue, Mar 03, 2009 at 03:35:26PM +0900, FUJITA Tomonori wrote:
> > On Mon, 2 Mar 2009 08:56:50 -0600
> > [email protected] wrote:
> >
> > > [...]
> > > > > + .this_id = -1,
> > > > > + .sg_tablesize = MAXSGENTRIES,
> > > >
> > > > MAXSGENTRIES (32) is the limitation of hardware? If not, it might be
> > > > better to enlarge this for better performance?
> > >
> > > Yes, definitely, though this value varies from controller to controller,
> > > so this is just a default value that needs to be overridden, probably
> > > in hpsa_scsi_detect().
> >
> > I see. If we override this in hpsa_scsi_detect(), we need a trick for
> > SG in CommandList_struct, I guess.
>
> Yes. There are some limits to what can be put into CommandList_struct
> directly, but there is also scatter gather chaining, in which we use
> the last element in the CommandList_struct to point to another buffer
> of SG entries.
>
> If you have a system with a lot of controllers, having a large number of
> scatter gathers can be a bit of a memory hog, and since this memory is all
> via pci_alloc_consistent, that can be a concern. It would be nice if
> there was a way for the user to specify differing amounts of scatter
> gathers for different controller instances so for instance the controller
> which he's running his big oracle database, or webserver or whatever on
> gets lots, while the controller he's booted from that's mostly idle
> gets not so many. I don't know what a good way for a user to identify
> what controller he's talking about in a module parameter would be
> though. Maybe by pci domain/bus/device/function? Maybe something along
> the lines of:
>
> modprobe hpsa dev1=0:0e:00.0 sg1=1000 dev2=0:0b:00.0 sg2=31
>
> to say that one controller gets 1000 scatter gather elements, but
> another gets only 31. But PCI busses can change if hardware
> configuration changes, and this isn't exactly obvious, so seems less
> than ideal. Any bright ideas on that front?
We have /sys/class/scsi_host/host*/sg_tablesize:
How about modifying this value on the fly?
fujita@clover:/sys/class/scsi_host/host3$ echo 1000 > sg_tablesize
Well, this needs more changes (to both the block layer and the scsi
mid layer) but is it nice to change this value dynamically?
Anyway, I think that it's better to address this fancy feature later
on (after the mainline inclusion). Let's put hpsa driver into mainline
first.
> > > Hmm, this doesn't seem all that complicated to me, and this code snippet
> > > has been pretty stable for about 10 years. it's nearly identical to what's in
> > > cpqarray in the 2.2.13 kernel from 1999:
> > >
> > > do {
> > > i = find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
> > > if (i == NR_CMDS)
> > > return NULL;
> > > } while(test_and_set_bit(i%32, h->cmd_pool_bits+(i/32)) != 0)
> > >
> > > It's fast, works well, and has needed very little maintenance over the
> > > years. Without knowing what you have in mind specifically, I don't see a
> > > big need to change this.
> >
> > I see. Seems that some drivers want something similar. I might come
> > back later on with a patch to replace this with library
> > functions.
>
> There was some other discussion about pushing this sort of thing to
> upper layers, using a tag generated in the scsi layer as a means
> of allocating driver command buffers, since, presumably there's a
> one to one mapping. (I didn't completely grok it all though.)
Oops, I meant that I might come back with a patch to convert hpsa to
use the the block layer tagging, which you and Mike Christie are
talking about (yeah, my first suggestion to use lists was wrong. using
the block layer tagging looks much better).
By the way, have you guys started to work on the review comments for
the next submission? The driver has some minor style issues that have
not been mentioned yet. For example, the comment style in the driver
is not preferred:
/* If this device a non-zero lun of a multi-lun device */
/* byte 4 of the 8-byte LUN addr will contain the logical */
/* unit no, zero otherise. */
The preferred style is:
/*
* If this device a non-zero lun of a multi-lun device
* byte 4 of the 8-byte LUN addr will contain the logical
* unit no, zero otherise.
*/
Another example, I think that the SCSI-ml preferred style is (not
documented in CodingStyle though):
'if (!ptr)' rather than 'if (ptr == NULL)'
'if (!value)' rather than 'if (value == 0)'
'if (ptr)' rather than 'if (ptr != NULL)'
'if (value)' rather than 'if (value != 0)'
If you are already addressing the review comments, I just wait for the
next submission, then I'll send such minor patches. If you are not,
I'll send patches to address the review comments (including such minor
patches).
On Thu, Mar 05, 2009 at 02:48:09PM +0900, FUJITA Tomonori wrote:
> On Tue, 3 Mar 2009 10:28:21 -0600
> [email protected] wrote:
>
> > On Tue, Mar 03, 2009 at 03:35:26PM +0900, FUJITA Tomonori wrote:
> > > On Mon, 2 Mar 2009 08:56:50 -0600
> > > [email protected] wrote:
> > >
> > > > [...]
> > > > > > + .this_id = -1,
> > > > > > + .sg_tablesize = MAXSGENTRIES,
> > > > >
> > > > > MAXSGENTRIES (32) is the limitation of hardware? If not, it might be
> > > > > better to enlarge this for better performance?
> > > >
> > > > Yes, definitely, though this value varies from controller to controller,
> > > > so this is just a default value that needs to be overridden, probably
> > > > in hpsa_scsi_detect().
> > >
> > > I see. If we override this in hpsa_scsi_detect(), we need a trick for
> > > SG in CommandList_struct, I guess.
> >
> > Yes. There are some limits to what can be put into CommandList_struct
> > directly, but there is also scatter gather chaining, in which we use
> > the last element in the CommandList_struct to point to another buffer
> > of SG entries.
> >
> > If you have a system with a lot of controllers, having a large number of
> > scatter gathers can be a bit of a memory hog, and since this memory is all
> > via pci_alloc_consistent, that can be a concern. It would be nice if
> > there was a way for the user to specify differing amounts of scatter
> > gathers for different controller instances so for instance the controller
> > which he's running his big oracle database, or webserver or whatever on
> > gets lots, while the controller he's booted from that's mostly idle
> > gets not so many. I don't know what a good way for a user to identify
> > what controller he's talking about in a module parameter would be
> > though. Maybe by pci domain/bus/device/function? Maybe something along
> > the lines of:
> >
> > modprobe hpsa dev1=0:0e:00.0 sg1=1000 dev2=0:0b:00.0 sg2=31
> >
> > to say that one controller gets 1000 scatter gather elements, but
> > another gets only 31. But PCI busses can change if hardware
> > configuration changes, and this isn't exactly obvious, so seems less
> > than ideal. Any bright ideas on that front?
>
> We have /sys/class/scsi_host/host*/sg_tablesize:
>
> How about modifying this value on the fly?
>
> fujita@clover:/sys/class/scsi_host/host3$ echo 1000 > sg_tablesize
>
We pci_alloc_consistent that space, so... I think that would mean
we'd have to do things considerably differently. I think we'd have
to quit allocating commands in big chunks, and instead of indexing
into that chunk we'd probably have to have an array of pointers or
something. If we wanted sg_tablesize adjustable down to single
command counts, we'd probably have to allocate each command separately
and have an array of pointers to those...
e.g. if you did
echo 1000 > sg_tablesize
echo 999 > sg_tablesize
you probably wouldn't want to keep the 1000 commands around,
and then allocate 999 additional, then let all the outstanding
commands using the first 1000 block complete, then finally free
the first block of 1000, leaving just the 999. You'd probably want
instead to free one of the 1000 to get to 999.
Likewise with this:
echo 999 > sg_tablesize
echo 1000 > sg_tablesize
These are somewhat pathological cases, granted.
I'm not sure dynamically modifying the number of SGs a controller
can do is something that comes up enough to be worth implementing
something so complicated.
If it's settable at init time, that would probably be enough for
the vast majority of uses (and more flexible than what we have now)
and a lot easier to implement.
>
> Well, this needs more changes (to both the block layer and the scsi
> mid layer) but is it nice to change this value dynamically?
>
> Anyway, I think that it's better to address this fancy feature later
> on (after the mainline inclusion). Let's put hpsa driver into mainline
> first.
Agreed, we can think about all that stuff later.
Another fancy feature to think about later which would be nice:
On Smart arrays you can expand logical drives on the fly by
adding physical disks, or portions of physical disks into them.
Would be nice if there was a non-i/o-interrupting way to notify
the scsi layer of this new space (maybe there already is?) so
that if there's, say, a filesystem which can also dynamically
grow on the fly on that embiggened logical drive, it can take
advantage of that extra space.
Right now, the driver will do scsi_remove_device() and then
scsi_add_device() if a logical drive changes size, which isn't
very nice.
>
>
> > > > Hmm, this doesn't seem all that complicated to me, and this code snippet
> > > > has been pretty stable for about 10 years. it's nearly identical to what's in
> > > > cpqarray in the 2.2.13 kernel from 1999:
> > > >
> > > > do {
> > > > i = find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
> > > > if (i == NR_CMDS)
> > > > return NULL;
> > > > } while(test_and_set_bit(i%32, h->cmd_pool_bits+(i/32)) != 0)
> > > >
> > > > It's fast, works well, and has needed very little maintenance over the
> > > > years. Without knowing what you have in mind specifically, I don't see a
> > > > big need to change this.
> > >
> > > I see. Seems that some drivers want something similar. I might come
> > > back later on with a patch to replace this with library
> > > functions.
> >
> > There was some other discussion about pushing this sort of thing to
> > upper layers, using a tag generated in the scsi layer as a means
> > of allocating driver command buffers, since, presumably there's a
> > one to one mapping. (I didn't completely grok it all though.)
>
> Oops, I meant that I might come back with a patch to convert hpsa to
> use the the block layer tagging, which you and Mike Christie are
> talking about (yeah, my first suggestion to use lists was wrong. using
> the block layer tagging looks much better).
>
>
> By the way, have you guys started to work on the review comments for
We haven't really done much. It's obvious that there's a lot to do
based on the comments, and it's also obvious how to do most of it,
and not hard, (e.g. ripping out /proc stuff, etc.), there's just a
lot of other non-kernel related work keeping us busy at the moment.
> the next submission? The driver has some minor style issues that have
> not been mentioned yet. For example, the comment style in the driver
> is not preferred:
>
> /* If this device a non-zero lun of a multi-lun device */
> /* byte 4 of the 8-byte LUN addr will contain the logical */
> /* unit no, zero otherise. */
>
> The preferred style is:
>
> /*
> * If this device a non-zero lun of a multi-lun device
> * byte 4 of the 8-byte LUN addr will contain the logical
> * unit no, zero otherise.
> */
ok.
>
> Another example, I think that the SCSI-ml preferred style is (not
> documented in CodingStyle though):
>
> 'if (!ptr)' rather than 'if (ptr == NULL)'
> 'if (!value)' rather than 'if (value == 0)'
> 'if (ptr)' rather than 'if (ptr != NULL)'
> 'if (value)' rather than 'if (value != 0)'
Ok.
>
>
> If you are already addressing the review comments, I just wait for the
> next submission, then I'll send such minor patches. If you are not,
> I'll send patches to address the review comments (including such minor
> patches).
Ok, thanks.
-- steve
>
> We have /sys/class/scsi_host/host*/sg_tablesize:
>
> How about modifying this value on the fly?
>
> fujita@clover:/sys/class/scsi_host/host3$ echo 1000 > sg_tablesize
>
>
> Well, this needs more changes (to both the block layer and
> the scsi mid layer) but is it nice to change this value dynamically?
>
> Anyway, I think that it's better to address this fancy
> feature later on (after the mainline inclusion). Let's put
> hpsa driver into mainline first.
>
>
> > > > Hmm, this doesn't seem all that complicated to me, and
> this code
> > > > snippet has been pretty stable for about 10 years. it's nearly
> > > > identical to what's in cpqarray in the 2.2.13 kernel from 1999:
> > > >
> > > > do {
> > > > i =
> find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
> > > > if (i == NR_CMDS)
> > > > return NULL;
> > > > } while(test_and_set_bit(i%32,
> > > > h->cmd_pool_bits+(i/32)) != 0)
> > > >
> > > > It's fast, works well, and has needed very little
> maintenance over
> > > > the years. Without knowing what you have in mind
> specifically, I
> > > > don't see a big need to change this.
> > >
> > > I see. Seems that some drivers want something similar. I
> might come
> > > back later on with a patch to replace this with library functions.
> >
> > There was some other discussion about pushing this sort of thing to
> > upper layers, using a tag generated in the scsi layer as a means of
> > allocating driver command buffers, since, presumably
> there's a one to
> > one mapping. (I didn't completely grok it all though.)
>
> Oops, I meant that I might come back with a patch to convert
> hpsa to use the the block layer tagging, which you and Mike
> Christie are talking about (yeah, my first suggestion to use
> lists was wrong. using the block layer tagging looks much better).
>
>
> By the way, have you guys started to work on the review
> comments for the next submission? The driver has some minor
> style issues that have not been mentioned yet. For example,
> the comment style in the driver is not preferred:
>
> /* If this device a non-zero lun of a multi-lun device */
> /* byte 4 of the 8-byte LUN addr will contain the logical */
> /* unit no, zero otherise. */
>
> The preferred style is:
>
> /*
> * If this device a non-zero lun of a multi-lun device
> * byte 4 of the 8-byte LUN addr will contain the logical
> * unit no, zero otherise.
> */
>
> Another example, I think that the SCSI-ml preferred style is
> (not documented in CodingStyle though):
>
> 'if (!ptr)' rather than 'if (ptr == NULL)'
> 'if (!value)' rather than 'if (value == 0)'
> 'if (ptr)' rather than 'if (ptr != NULL)'
> 'if (value)' rather than 'if (value != 0)'
>
>
> If you are already addressing the review comments, I just
> wait for the next submission, then I'll send such minor
> patches. If you are not, I'll send patches to address the
> review comments (including such minor patches).
>
We're working on the review comments. Right we're trying to get caught up with our "day jobs."
-- mikem-
On Thu, 2009-03-05 at 08:21 -0600, [email protected]
wrote:
> On Thu, Mar 05, 2009 at 02:48:09PM +0900, FUJITA Tomonori wrote:
> > On Tue, 3 Mar 2009 10:28:21 -0600
> > [email protected] wrote:
> >
> > > On Tue, Mar 03, 2009 at 03:35:26PM +0900, FUJITA Tomonori wrote:
> > > > On Mon, 2 Mar 2009 08:56:50 -0600
> > > > [email protected] wrote:
> > > >
> > > > > [...]
> > > > > > > + .this_id = -1,
> > > > > > > + .sg_tablesize = MAXSGENTRIES,
> > > > > >
> > > > > > MAXSGENTRIES (32) is the limitation of hardware? If not, it might be
> > > > > > better to enlarge this for better performance?
> > > > >
> > > > > Yes, definitely, though this value varies from controller to controller,
> > > > > so this is just a default value that needs to be overridden, probably
> > > > > in hpsa_scsi_detect().
> > > >
> > > > I see. If we override this in hpsa_scsi_detect(), we need a trick for
> > > > SG in CommandList_struct, I guess.
> > >
> > > Yes. There are some limits to what can be put into CommandList_struct
> > > directly, but there is also scatter gather chaining, in which we use
> > > the last element in the CommandList_struct to point to another buffer
> > > of SG entries.
> > >
> > > If you have a system with a lot of controllers, having a large number of
> > > scatter gathers can be a bit of a memory hog, and since this memory is all
> > > via pci_alloc_consistent, that can be a concern. It would be nice if
> > > there was a way for the user to specify differing amounts of scatter
> > > gathers for different controller instances so for instance the controller
> > > which he's running his big oracle database, or webserver or whatever on
> > > gets lots, while the controller he's booted from that's mostly idle
> > > gets not so many. I don't know what a good way for a user to identify
> > > what controller he's talking about in a module parameter would be
> > > though. Maybe by pci domain/bus/device/function? Maybe something along
> > > the lines of:
> > >
> > > modprobe hpsa dev1=0:0e:00.0 sg1=1000 dev2=0:0b:00.0 sg2=31
> > >
> > > to say that one controller gets 1000 scatter gather elements, but
> > > another gets only 31. But PCI busses can change if hardware
> > > configuration changes, and this isn't exactly obvious, so seems less
> > > than ideal. Any bright ideas on that front?
> >
> > We have /sys/class/scsi_host/host*/sg_tablesize:
> >
> > How about modifying this value on the fly?
> >
> > fujita@clover:/sys/class/scsi_host/host3$ echo 1000 > sg_tablesize
> >
>
> We pci_alloc_consistent that space, so... I think that would mean
> we'd have to do things considerably differently. I think we'd have
> to quit allocating commands in big chunks, and instead of indexing
> into that chunk we'd probably have to have an array of pointers or
> something. If we wanted sg_tablesize adjustable down to single
> command counts, we'd probably have to allocate each command separately
> and have an array of pointers to those...
>
> e.g. if you did
>
> echo 1000 > sg_tablesize
> echo 999 > sg_tablesize
>
> you probably wouldn't want to keep the 1000 commands around,
> and then allocate 999 additional, then let all the outstanding
> commands using the first 1000 block complete, then finally free
> the first block of 1000, leaving just the 999. You'd probably want
> instead to free one of the 1000 to get to 999.
>
> Likewise with this:
>
> echo 999 > sg_tablesize
> echo 1000 > sg_tablesize
>
> These are somewhat pathological cases, granted.
>
> I'm not sure dynamically modifying the number of SGs a controller
> can do is something that comes up enough to be worth implementing
> something so complicated.
>
> If it's settable at init time, that would probably be enough for
> the vast majority of uses (and more flexible than what we have now)
> and a lot easier to implement.
>
> >
> > Well, this needs more changes (to both the block layer and the scsi
> > mid layer) but is it nice to change this value dynamically?
> >
> > Anyway, I think that it's better to address this fancy feature later
> > on (after the mainline inclusion). Let's put hpsa driver into mainline
> > first.
>
> Agreed, we can think about all that stuff later.
>
> Another fancy feature to think about later which would be nice:
>
> On Smart arrays you can expand logical drives on the fly by
> adding physical disks, or portions of physical disks into them.
> Would be nice if there was a non-i/o-interrupting way to notify
> the scsi layer of this new space (maybe there already is?) so
> that if there's, say, a filesystem which can also dynamically
> grow on the fly on that embiggened logical drive, it can take
> advantage of that extra space.
>
> Right now, the driver will do scsi_remove_device() and then
> scsi_add_device() if a logical drive changes size, which isn't
> very nice.
>
The following might help:
There are several ways to "kick off" a device size change:
1. For SCSI devices do:
# echo 1 > /sys/class/scsi_device/<device>/device/rescan
or
# blockdev --rereadpt <device file>
2. Other devices (not device mapper)
# blockdev --rereadpt <device file>
See http://marc.info/?l=linux-kernel&m=122056065131792&w=2
Andrew
> >
> >
> > > > > Hmm, this doesn't seem all that complicated to me, and this code snippet
> > > > > has been pretty stable for about 10 years. it's nearly identical to what's in
> > > > > cpqarray in the 2.2.13 kernel from 1999:
> > > > >
> > > > > do {
> > > > > i = find_first_zero_bit(h->cmd_pool_bits, NR_CMDS);
> > > > > if (i == NR_CMDS)
> > > > > return NULL;
> > > > > } while(test_and_set_bit(i%32, h->cmd_pool_bits+(i/32)) != 0)
> > > > >
> > > > > It's fast, works well, and has needed very little maintenance over the
> > > > > years. Without knowing what you have in mind specifically, I don't see a
> > > > > big need to change this.
> > > >
> > > > I see. Seems that some drivers want something similar. I might come
> > > > back later on with a patch to replace this with library
> > > > functions.
> > >
> > > There was some other discussion about pushing this sort of thing to
> > > upper layers, using a tag generated in the scsi layer as a means
> > > of allocating driver command buffers, since, presumably there's a
> > > one to one mapping. (I didn't completely grok it all though.)
> >
> > Oops, I meant that I might come back with a patch to convert hpsa to
> > use the the block layer tagging, which you and Mike Christie are
> > talking about (yeah, my first suggestion to use lists was wrong. using
> > the block layer tagging looks much better).
> >
> >
> > By the way, have you guys started to work on the review comments for
>
> We haven't really done much. It's obvious that there's a lot to do
> based on the comments, and it's also obvious how to do most of it,
> and not hard, (e.g. ripping out /proc stuff, etc.), there's just a
> lot of other non-kernel related work keeping us busy at the moment.
>
> > the next submission? The driver has some minor style issues that have
> > not been mentioned yet. For example, the comment style in the driver
> > is not preferred:
> >
> > /* If this device a non-zero lun of a multi-lun device */
> > /* byte 4 of the 8-byte LUN addr will contain the logical */
> > /* unit no, zero otherise. */
> >
> > The preferred style is:
> >
> > /*
> > * If this device a non-zero lun of a multi-lun device
> > * byte 4 of the 8-byte LUN addr will contain the logical
> > * unit no, zero otherise.
> > */
>
> ok.
>
> >
> > Another example, I think that the SCSI-ml preferred style is (not
> > documented in CodingStyle though):
> >
> > 'if (!ptr)' rather than 'if (ptr == NULL)'
> > 'if (!value)' rather than 'if (value == 0)'
> > 'if (ptr)' rather than 'if (ptr != NULL)'
> > 'if (value)' rather than 'if (value != 0)'
>
> Ok.
>
> >
> >
> > If you are already addressing the review comments, I just wait for the
> > next submission, then I'll send such minor patches. If you are not,
> > I'll send patches to address the review comments (including such minor
> > patches).
>
> Ok, thanks.
>
> -- steve
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Thu, Mar 05 2009, [email protected] wrote:
> On Thu, Mar 05, 2009 at 02:48:09PM +0900, FUJITA Tomonori wrote:
> > On Tue, 3 Mar 2009 10:28:21 -0600
> > [email protected] wrote:
> >
> > > On Tue, Mar 03, 2009 at 03:35:26PM +0900, FUJITA Tomonori wrote:
> > > > On Mon, 2 Mar 2009 08:56:50 -0600
> > > > [email protected] wrote:
> > > >
> > > > > [...]
> > > > > > > + .this_id = -1,
> > > > > > > + .sg_tablesize = MAXSGENTRIES,
> > > > > >
> > > > > > MAXSGENTRIES (32) is the limitation of hardware? If not, it might be
> > > > > > better to enlarge this for better performance?
> > > > >
> > > > > Yes, definitely, though this value varies from controller to controller,
> > > > > so this is just a default value that needs to be overridden, probably
> > > > > in hpsa_scsi_detect().
> > > >
> > > > I see. If we override this in hpsa_scsi_detect(), we need a trick for
> > > > SG in CommandList_struct, I guess.
> > >
> > > Yes. There are some limits to what can be put into CommandList_struct
> > > directly, but there is also scatter gather chaining, in which we use
> > > the last element in the CommandList_struct to point to another buffer
> > > of SG entries.
> > >
> > > If you have a system with a lot of controllers, having a large number of
> > > scatter gathers can be a bit of a memory hog, and since this memory is all
> > > via pci_alloc_consistent, that can be a concern. It would be nice if
> > > there was a way for the user to specify differing amounts of scatter
> > > gathers for different controller instances so for instance the controller
> > > which he's running his big oracle database, or webserver or whatever on
> > > gets lots, while the controller he's booted from that's mostly idle
> > > gets not so many. I don't know what a good way for a user to identify
> > > what controller he's talking about in a module parameter would be
> > > though. Maybe by pci domain/bus/device/function? Maybe something along
> > > the lines of:
> > >
> > > modprobe hpsa dev1=0:0e:00.0 sg1=1000 dev2=0:0b:00.0 sg2=31
> > >
> > > to say that one controller gets 1000 scatter gather elements, but
> > > another gets only 31. But PCI busses can change if hardware
> > > configuration changes, and this isn't exactly obvious, so seems less
> > > than ideal. Any bright ideas on that front?
> >
> > We have /sys/class/scsi_host/host*/sg_tablesize:
> >
> > How about modifying this value on the fly?
> >
> > fujita@clover:/sys/class/scsi_host/host3$ echo 1000 > sg_tablesize
> >
>
> We pci_alloc_consistent that space, so... I think that would mean
> we'd have to do things considerably differently. I think we'd have
> to quit allocating commands in big chunks, and instead of indexing
> into that chunk we'd probably have to have an array of pointers or
> something. If we wanted sg_tablesize adjustable down to single
> command counts, we'd probably have to allocate each command separately
> and have an array of pointers to those...
>
> e.g. if you did
>
> echo 1000 > sg_tablesize
> echo 999 > sg_tablesize
>
> you probably wouldn't want to keep the 1000 commands around,
> and then allocate 999 additional, then let all the outstanding
> commands using the first 1000 block complete, then finally free
> the first block of 1000, leaving just the 999. You'd probably want
> instead to free one of the 1000 to get to 999.
>
> Likewise with this:
>
> echo 999 > sg_tablesize
> echo 1000 > sg_tablesize
>
> These are somewhat pathological cases, granted.
>
> I'm not sure dynamically modifying the number of SGs a controller
> can do is something that comes up enough to be worth implementing
> something so complicated.
>
> If it's settable at init time, that would probably be enough for
> the vast majority of uses (and more flexible than what we have now)
> and a lot easier to implement.
Completely agree, don't waste time implementing something that nobody
will ever touch. The only reason to fiddle with such a setting would be
to increase it, because ios are too small. And even finding out that the
segment limit is the one killing you would take some insight and work
from the user.
Just make it Big Enough to cover most cases. 32 is definitely small, 256
entries would get you 1MB ios which I guess is more appropriate.
--
Jens Axboe
On Fri, 6 Mar 2009 09:55:29 +0100
Jens Axboe <[email protected]> wrote:
> > If it's settable at init time, that would probably be enough for
> > the vast majority of uses (and more flexible than what we have now)
> > and a lot easier to implement.
>
> Completely agree, don't waste time implementing something that nobody
> will ever touch. The only reason to fiddle with such a setting would be
> to increase it, because ios are too small. And even finding out that the
> segment limit is the one killing you would take some insight and work
> from the user.
>
> Just make it Big Enough to cover most cases. 32 is definitely small, 256
> entries would get you 1MB ios which I guess is more appropriate.
I guess that the dynamic scheme is overdoing but seems that vendors
like some way to configure the sg entry size. The new MPT2SAS driver
has SCSI_MPT2SAS_MAX_SGE kernel config option:
http://marc.info/?l=linux-scsi&m=123619290803547&w=2
The kernel module option for this might be appropriate.
On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> On Fri, 6 Mar 2009 09:55:29 +0100
> Jens Axboe <[email protected]> wrote:
>
> > > If it's settable at init time, that would probably be enough for
> > > the vast majority of uses (and more flexible than what we have now)
> > > and a lot easier to implement.
> >
> > Completely agree, don't waste time implementing something that nobody
> > will ever touch. The only reason to fiddle with such a setting would be
> > to increase it, because ios are too small. And even finding out that the
> > segment limit is the one killing you would take some insight and work
> > from the user.
> >
> > Just make it Big Enough to cover most cases. 32 is definitely small, 256
> > entries would get you 1MB ios which I guess is more appropriate.
>
> I guess that the dynamic scheme is overdoing but seems that vendors
> like some way to configure the sg entry size. The new MPT2SAS driver
> has SCSI_MPT2SAS_MAX_SGE kernel config option:
>
> http://marc.info/?l=linux-scsi&m=123619290803547&w=2
>
>
> The kernel module option for this might be appropriate.
Dunno, still seems pretty pointless to me. The config option there
quotes memory consumption as the reason to reduce the number of sg
entries, however I think that's pretty silly. Additionally, a kernel
config entry just means that customers will be stuck with a fixed value
anyway. So I just don't see any merit to doing it that way either.
--
Jens Axboe
On Fri, 6 Mar 2009 10:21:14 +0100
Jens Axboe <[email protected]> wrote:
> On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> > On Fri, 6 Mar 2009 09:55:29 +0100
> > Jens Axboe <[email protected]> wrote:
> >
> > > > If it's settable at init time, that would probably be enough for
> > > > the vast majority of uses (and more flexible than what we have now)
> > > > and a lot easier to implement.
> > >
> > > Completely agree, don't waste time implementing something that nobody
> > > will ever touch. The only reason to fiddle with such a setting would be
> > > to increase it, because ios are too small. And even finding out that the
> > > segment limit is the one killing you would take some insight and work
> > > from the user.
> > >
> > > Just make it Big Enough to cover most cases. 32 is definitely small, 256
> > > entries would get you 1MB ios which I guess is more appropriate.
> >
> > I guess that the dynamic scheme is overdoing but seems that vendors
> > like some way to configure the sg entry size. The new MPT2SAS driver
> > has SCSI_MPT2SAS_MAX_SGE kernel config option:
> >
> > http://marc.info/?l=linux-scsi&m=123619290803547&w=2
> >
> >
> > The kernel module option for this might be appropriate.
>
> Dunno, still seems pretty pointless to me. The config option there
> quotes memory consumption as the reason to reduce the number of sg
> entries, however I think that's pretty silly. Additionally, a kernel
> config entry just means that customers will be stuck with a fixed value
> anyway. So I just don't see any merit to doing it that way either.
Yeah, agreed. the kernel config option is pretty pointless. But I'm
not sure that reducing memory consumption is completely pointless.
On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> On Fri, 6 Mar 2009 10:21:14 +0100
> Jens Axboe <[email protected]> wrote:
>
> > On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> > > On Fri, 6 Mar 2009 09:55:29 +0100
> > > Jens Axboe <[email protected]> wrote:
> > >
> > > > > If it's settable at init time, that would probably be enough for
> > > > > the vast majority of uses (and more flexible than what we have now)
> > > > > and a lot easier to implement.
> > > >
> > > > Completely agree, don't waste time implementing something that nobody
> > > > will ever touch. The only reason to fiddle with such a setting would be
> > > > to increase it, because ios are too small. And even finding out that the
> > > > segment limit is the one killing you would take some insight and work
> > > > from the user.
> > > >
> > > > Just make it Big Enough to cover most cases. 32 is definitely small, 256
> > > > entries would get you 1MB ios which I guess is more appropriate.
> > >
> > > I guess that the dynamic scheme is overdoing but seems that vendors
> > > like some way to configure the sg entry size. The new MPT2SAS driver
> > > has SCSI_MPT2SAS_MAX_SGE kernel config option:
> > >
> > > http://marc.info/?l=linux-scsi&m=123619290803547&w=2
> > >
> > >
> > > The kernel module option for this might be appropriate.
> >
> > Dunno, still seems pretty pointless to me. The config option there
> > quotes memory consumption as the reason to reduce the number of sg
> > entries, however I think that's pretty silly. Additionally, a kernel
> > config entry just means that customers will be stuck with a fixed value
> > anyway. So I just don't see any merit to doing it that way either.
>
> Yeah, agreed. the kernel config option is pretty pointless. But I'm
> not sure that reducing memory consumption is completely pointless.
Agree, depends on how you do it. If you preallocate all the memory
required for 1024 entries times the queue depth, then it may not be that
small. But you can do it a bit more cleverly than that, and then I don't
think it makes a lot of sense to provide any options for shrinking it.
--
Jens Axboe
On Fri, Mar 06, 2009 at 10:35:21AM +0100, Jens Axboe wrote:
> On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> > On Fri, 6 Mar 2009 10:21:14 +0100
> > Jens Axboe <[email protected]> wrote:
> >
> > > On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> > > > On Fri, 6 Mar 2009 09:55:29 +0100
> > > > Jens Axboe <[email protected]> wrote:
> > > >
> > > > > > If it's settable at init time, that would probably be enough for
> > > > > > the vast majority of uses (and more flexible than what we have now)
> > > > > > and a lot easier to implement.
> > > > >
> > > > > Completely agree, don't waste time implementing something that nobody
> > > > > will ever touch. The only reason to fiddle with such a setting would be
> > > > > to increase it, because ios are too small. And even finding out that the
> > > > > segment limit is the one killing you would take some insight and work
> > > > > from the user.
> > > > >
> > > > > Just make it Big Enough to cover most cases. 32 is definitely small, 256
> > > > > entries would get you 1MB ios which I guess is more appropriate.
> > > >
> > > > I guess that the dynamic scheme is overdoing but seems that vendors
> > > > like some way to configure the sg entry size. The new MPT2SAS driver
> > > > has SCSI_MPT2SAS_MAX_SGE kernel config option:
> > > >
> > > > http://marc.info/?l=linux-scsi&m=123619290803547&w=2
> > > >
> > > >
> > > > The kernel module option for this might be appropriate.
> > >
> > > Dunno, still seems pretty pointless to me. The config option there
> > > quotes memory consumption as the reason to reduce the number of sg
> > > entries, however I think that's pretty silly. Additionally, a kernel
> > > config entry just means that customers will be stuck with a fixed value
> > > anyway. So I just don't see any merit to doing it that way either.
> >
> > Yeah, agreed. the kernel config option is pretty pointless. But I'm
> > not sure that reducing memory consumption is completely pointless.
>
> Agree, depends on how you do it. If you preallocate all the memory
> required for 1024 entries times the queue depth, then it may not be that
> small. But you can do it a bit more cleverly than that, and then I don't
> think it makes a lot of sense to provide any options for shrinking it.
The reason I mentioned making the number of SGs configurable is because with
a lot of controllers in the box (say 8, or ridiculous numbers of controllers
are potentially possible on some big ia64 boxes) then the memory available
by way of pci_alloc_consistent can be exhausted, and we have seen that happen.
The command buffers have to be in the first 4GB of memory, as the command
register is only 32 bits, so they are allocated by pci_alloc_consistent.
However, the chained SG lists don't have that limitation, so I think they
can be kmalloc'ed, and so not chew up and unreasonable amount of the
pci_alloc_consistent memory and get a larger number of SGs. ...right?
Maybe that's the better way to do it.
-- steve
On Fri, Mar 06 2009, [email protected] wrote:
> On Fri, Mar 06, 2009 at 10:35:21AM +0100, Jens Axboe wrote:
> > On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> > > On Fri, 6 Mar 2009 10:21:14 +0100
> > > Jens Axboe <[email protected]> wrote:
> > >
> > > > On Fri, Mar 06 2009, FUJITA Tomonori wrote:
> > > > > On Fri, 6 Mar 2009 09:55:29 +0100
> > > > > Jens Axboe <[email protected]> wrote:
> > > > >
> > > > > > > If it's settable at init time, that would probably be enough for
> > > > > > > the vast majority of uses (and more flexible than what we have now)
> > > > > > > and a lot easier to implement.
> > > > > >
> > > > > > Completely agree, don't waste time implementing something that nobody
> > > > > > will ever touch. The only reason to fiddle with such a setting would be
> > > > > > to increase it, because ios are too small. And even finding out that the
> > > > > > segment limit is the one killing you would take some insight and work
> > > > > > from the user.
> > > > > >
> > > > > > Just make it Big Enough to cover most cases. 32 is definitely small, 256
> > > > > > entries would get you 1MB ios which I guess is more appropriate.
> > > > >
> > > > > I guess that the dynamic scheme is overdoing but seems that vendors
> > > > > like some way to configure the sg entry size. The new MPT2SAS driver
> > > > > has SCSI_MPT2SAS_MAX_SGE kernel config option:
> > > > >
> > > > > http://marc.info/?l=linux-scsi&m=123619290803547&w=2
> > > > >
> > > > >
> > > > > The kernel module option for this might be appropriate.
> > > >
> > > > Dunno, still seems pretty pointless to me. The config option there
> > > > quotes memory consumption as the reason to reduce the number of sg
> > > > entries, however I think that's pretty silly. Additionally, a kernel
> > > > config entry just means that customers will be stuck with a fixed value
> > > > anyway. So I just don't see any merit to doing it that way either.
> > >
> > > Yeah, agreed. the kernel config option is pretty pointless. But I'm
> > > not sure that reducing memory consumption is completely pointless.
> >
> > Agree, depends on how you do it. If you preallocate all the memory
> > required for 1024 entries times the queue depth, then it may not be that
> > small. But you can do it a bit more cleverly than that, and then I don't
> > think it makes a lot of sense to provide any options for shrinking it.
>
> The reason I mentioned making the number of SGs configurable is because with
> a lot of controllers in the box (say 8, or ridiculous numbers of controllers
> are potentially possible on some big ia64 boxes) then the memory available
> by way of pci_alloc_consistent can be exhausted, and we have seen that happen.
>
> The command buffers have to be in the first 4GB of memory, as the command
> register is only 32 bits, so they are allocated by pci_alloc_consistent.
> However, the chained SG lists don't have that limitation, so I think they
> can be kmalloc'ed, and so not chew up and unreasonable amount of the
> pci_alloc_consistent memory and get a larger number of SGs. ...right?
> Maybe that's the better way to do it.
You can use GFP_DMA32 for kmalloc() allocations below 4G. But you could
just keep the command allocation with pci_alloc_consistent() and
allocate the sgtables with ordinary kmalloc, as you suggest.
--
Jens Axboe
On Fri, Mar 6, 2009 at 6:38 AM, <[email protected]> wrote:
...
> The command buffers have to be in the first 4GB of memory, as the command
> register is only 32 bits, so they are allocated by pci_alloc_consistent.
Huh?!!
ISTR the mpt2sas driver is indicating it can handle 64-bit DMA masks for
both streaming and control data. I need to double check to be sure of that.
> However, the chained SG lists don't have that limitation, so I think they
> can be kmalloc'ed, and so not chew up and unreasonable amount of the
> pci_alloc_consistent memory and get a larger number of SGs. ...right?
> Maybe that's the better way to do it.
I thought the driver was tracking this and using the appropriate construct
based on which DMA mask is in effect.
hth,
grant
> -- steve
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Fri, Mar 06, 2009 at 12:59:48PM -0800, Grant Grundler wrote:
> On Fri, Mar 6, 2009 at 6:38 AM, <[email protected]> wrote:
> ...
> > The command buffers have to be in the first 4GB of memory, as the command
> > register is only 32 bits, so they are allocated by pci_alloc_consistent.
>
> Huh?!!
> ISTR the mpt2sas driver is indicating it can handle 64-bit DMA masks for
> both streaming and control data. I need to double check to be sure of that.
it is something specific to smart array. The command register that we
stuff the bus address of the command into is only 32 bits wide. Everything
else it does is 64 bits.
>
>
> > However, the chained SG lists don't have that limitation, so I think they
> > can be kmalloc'ed, and so not chew up and unreasonable amount of the
> > pci_alloc_consistent memory and get a larger number of SGs. ? ...right?
> > Maybe that's the better way to do it.
>
> I thought the driver was tracking this and using the appropriate construct
> based on which DMA mask is in effect.
The DMA mask is insufficiently expressive to describe the limitations and
capabilities of the Smart array. There's no way to describe with a single
DMA mask that the command register is 32-bits, but everything else is 64
bits.
-- steve
On Fri, Mar 6, 2009 at 1:18 PM, <[email protected]> wrote:
> On Fri, Mar 06, 2009 at 12:59:48PM -0800, Grant Grundler wrote:
>> On Fri, Mar 6, 2009 at 6:38 AM, <[email protected]> wrote:
>> ...
>> > The command buffers have to be in the first 4GB of memory, as the command
>> > register is only 32 bits, so they are allocated by pci_alloc_consistent.
>>
>> Huh?!!
>> ISTR the mpt2sas driver is indicating it can handle 64-bit DMA masks for
>> both streaming and control data. I need to double check to be sure of that.
>
> it is something specific to smart array. The command register that we
> stuff the bus address of the command into is only 32 bits wide. Everything
> else it does is 64 bits.
Sorry...I'm spacing out and confusing my drivers.
thanks,
grant
>>
>> > However, the chained SG lists don't have that limitation, so I think they
>> > can be kmalloc'ed, and so not chew up and unreasonable amount of the
>> > pci_alloc_consistent memory and get a larger number of SGs. ...right?
>> > Maybe that's the better way to do it.
>>
>> I thought the driver was tracking this and using the appropriate construct
>> based on which DMA mask is in effect.
>
> The DMA mask is insufficiently expressive to describe the limitations and
> capabilities of the Smart array. There's no way to describe with a single
> DMA mask that the command register is 32-bits, but everything else is 64
> bits.
>
> -- steve
>
>
On Fri, 2009-03-06 at 15:18 -0600, [email protected]
wrote:
> On Fri, Mar 06, 2009 at 12:59:48PM -0800, Grant Grundler wrote:
> > On Fri, Mar 6, 2009 at 6:38 AM, <[email protected]> wrote:
> > ...
> > > The command buffers have to be in the first 4GB of memory, as the command
> > > register is only 32 bits, so they are allocated by pci_alloc_consistent.
> >
> > Huh?!!
> > ISTR the mpt2sas driver is indicating it can handle 64-bit DMA masks for
> > both streaming and control data. I need to double check to be sure of that.
>
> it is something specific to smart array. The command register that we
> stuff the bus address of the command into is only 32 bits wide. Everything
> else it does is 64 bits.
> >
> >
> > > However, the chained SG lists don't have that limitation, so I think they
> > > can be kmalloc'ed, and so not chew up and unreasonable amount of the
> > > pci_alloc_consistent memory and get a larger number of SGs. ...right?
> > > Maybe that's the better way to do it.
> >
> > I thought the driver was tracking this and using the appropriate construct
> > based on which DMA mask is in effect.
>
> The DMA mask is insufficiently expressive to describe the limitations and
> capabilities of the Smart array. There's no way to describe with a single
> DMA mask that the command register is 32-bits, but everything else is 64
> bits.
Actually, there is ... it's what you're doing: use a coherent mask of 32
bits and a dma mask of 64bits.
The aic79xx has exactly the same problem (its internal sequencer only
has a 32 bit wide programme counter, so it can only execute sequencer
scripts if they're in the first 4GB of memory). I think it's fairly
common amongst intelligent controllers that are old enough to have been
32 bit only but which got extended to work on 64 bits.
To get ordinary memory for this, you just use GFP_DMA32 as has been
previously stated.
James