LinuxLists.cc - [PATCH] scatterlist: Speed up for_each

2019-10-25 21:37:24

Subject: [PATCH] scatterlist: Speed up for_each_sg() loop macro

From: Sultan Alsawaf <[email protected]>

Scatterlists are chained in predictable arrays of up to
SG_MAX_SINGLE_ALLOC sg structs in length. Using this knowledge, speed up
for_each_sg() by using constant operations to determine when to simply
increment the sg pointer by one or get the next sg array in the chain.

Rudimentary measurements with a trivial loop body show that this yields
roughly a 2x performance gain.

The following simple test module proves the correctness of the new loop
definition by testing all the different edge cases of sg chains:
#include <linux/module.h>
#include <linux/scatterlist.h>
#include <linux/slab.h>

static int __init test_for_each_sg(void)
{
static const gfp_t gfp_flags = GFP_KERNEL | __GFP_NOFAIL;
struct scatterlist *sg;
struct sg_table *table;
long old = 0, new = 0;
unsigned int i, nents;

table = kmalloc(sizeof(*table), gfp_flags);
for (nents = 1; nents <= 3 * SG_MAX_SINGLE_ALLOC; nents++) {
BUG_ON(sg_alloc_table(table, nents, gfp_flags));
for (sg = table->sgl; sg; sg = sg_next(sg))
old ^= (long)sg;
for_each_sg(table->sgl, sg, nents, i)
new ^= (long)sg;
sg_free_table(table);
}

BUG_ON(old != new);
kfree(table);
return 0;
}
module_init(test_for_each_sg);

Signed-off-by: Sultan Alsawaf <[email protected]>
---
include/linux/scatterlist.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 556ec1ea2574..73f7fd6702d7 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -146,7 +146,10 @@ static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
* Loop over each sg element, following the pointer to a new list if necessary
*/
#define for_each_sg(sglist, sg, nr, __i) \
- for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))
+ for (__i = 0, sg = (sglist); __i < (nr); \
+ likely(++__i % (SG_MAX_SINGLE_ALLOC - 1) || \
+ (__i + 1) >= (nr)) ? sg++ : \
+ (sg = sg_chain_ptr(sg + 1)))

/**
* sg_chain - Chain two sglists together
--
2.23.0

2019-10-28 21:18:42

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

On Mon, Oct 28, 2019 at 01:23:20PM -0300, Jason Gunthorpe wrote:
> This testing is making assumptions about how 'nr' is used and the
> construction of the sgl though
>
> If any chains are partially populated, or for some reason the driver
> starts at a different sgl, it will break. You'll need to somehow
> show none of those possibilities are happening.

And there is nothing forcing a particular layout, there just happens
to be a layout that the generic allocator gives you. I'm not even
sure the original patch handles the SCSI case of small inlines segments
properly.

2019-10-28 21:20:25

by Sultan Alsawaf

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

On Mon, Oct 28, 2019 at 09:28:16AM -0700, Christoph Hellwig wrote:
> And there is nothing forcing a particular layout, there just happens
> to be a layout that the generic allocator gives you. I'm not even
> sure the original patch handles the SCSI case of small inlines segments
> properly.

I'm doubtful; are there really sg users that craft their own sglists and then
use for_each_sg() on them? But like I mentioned in the email I sent a
few minutes ago, this can be alleviated with a more comprehensive version of
this patch that alters all for_each_sg() users and thus ensures that only
sg_table pointers can be used with the macro. Anyone who would munge their own
sg_table together would simply be insane :)

Or there could be a separate macro for iterating through each sg in an sg_table.
There are lots of ways to go about this.

Sultan

2019-10-29 00:04:24

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

Hi Sultan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.4-rc5 next-20191028]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Sultan-Alsawaf/scatterlist-Speed-up-for_each_sg-loop-macro/20191029-045656
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 8005803a2ca0af49f36f6e9329b5ecda3df27347
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 7.4.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.4.0 make.cross ARCH=arm

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

drivers/dma/st_fdma.c: In function 'st_fdma_prep_slave_sg':
>> drivers/dma/st_fdma.c:549:19: warning: 'hw_node' may be used uninitialized in this function [-Wmaybe-uninitialized]
hw_node->control |= FDMA_NODE_CTRL_INT_EON;
^~

vim +/hw_node +549 drivers/dma/st_fdma.c

6b4cd727eaf15e Peter Griffin 2016-10-18 504
6b4cd727eaf15e Peter Griffin 2016-10-18 505 static struct dma_async_tx_descriptor *st_fdma_prep_slave_sg(
6b4cd727eaf15e Peter Griffin 2016-10-18 506 struct dma_chan *chan, struct scatterlist *sgl,
6b4cd727eaf15e Peter Griffin 2016-10-18 507 unsigned int sg_len, enum dma_transfer_direction direction,
6b4cd727eaf15e Peter Griffin 2016-10-18 508 unsigned long flags, void *context)
6b4cd727eaf15e Peter Griffin 2016-10-18 509 {
6b4cd727eaf15e Peter Griffin 2016-10-18 510 struct st_fdma_chan *fchan;
6b4cd727eaf15e Peter Griffin 2016-10-18 511 struct st_fdma_desc *fdesc;
6b4cd727eaf15e Peter Griffin 2016-10-18 512 struct st_fdma_hw_node *hw_node;
6b4cd727eaf15e Peter Griffin 2016-10-18 513 struct scatterlist *sg;
6b4cd727eaf15e Peter Griffin 2016-10-18 514 int i;
6b4cd727eaf15e Peter Griffin 2016-10-18 515
6b4cd727eaf15e Peter Griffin 2016-10-18 516 fchan = st_fdma_prep_common(chan, sg_len, direction);
6b4cd727eaf15e Peter Griffin 2016-10-18 517 if (!fchan)
6b4cd727eaf15e Peter Griffin 2016-10-18 518 return NULL;
6b4cd727eaf15e Peter Griffin 2016-10-18 519
6b4cd727eaf15e Peter Griffin 2016-10-18 520 if (!sgl)
6b4cd727eaf15e Peter Griffin 2016-10-18 521 return NULL;
6b4cd727eaf15e Peter Griffin 2016-10-18 522
6b4cd727eaf15e Peter Griffin 2016-10-18 523 fdesc = st_fdma_alloc_desc(fchan, sg_len);
6b4cd727eaf15e Peter Griffin 2016-10-18 524 if (!fdesc) {
6b4cd727eaf15e Peter Griffin 2016-10-18 525 dev_err(fchan->fdev->dev, "no memory for desc\n");
6b4cd727eaf15e Peter Griffin 2016-10-18 526 return NULL;
6b4cd727eaf15e Peter Griffin 2016-10-18 527 }
6b4cd727eaf15e Peter Griffin 2016-10-18 528
6b4cd727eaf15e Peter Griffin 2016-10-18 529 fdesc->iscyclic = false;
6b4cd727eaf15e Peter Griffin 2016-10-18 530
6b4cd727eaf15e Peter Griffin 2016-10-18 531 for_each_sg(sgl, sg, sg_len, i) {
6b4cd727eaf15e Peter Griffin 2016-10-18 532 hw_node = fdesc->node[i].desc;
6b4cd727eaf15e Peter Griffin 2016-10-18 533
6b4cd727eaf15e Peter Griffin 2016-10-18 534 hw_node->next = fdesc->node[(i + 1) % sg_len].pdesc;
6b4cd727eaf15e Peter Griffin 2016-10-18 535 hw_node->control = FDMA_NODE_CTRL_REQ_MAP_DREQ(fchan->dreq_line);
6b4cd727eaf15e Peter Griffin 2016-10-18 536
6b4cd727eaf15e Peter Griffin 2016-10-18 537 fill_hw_node(hw_node, fchan, direction);
6b4cd727eaf15e Peter Griffin 2016-10-18 538
6b4cd727eaf15e Peter Griffin 2016-10-18 539 if (direction == DMA_MEM_TO_DEV)
6b4cd727eaf15e Peter Griffin 2016-10-18 540 hw_node->saddr = sg_dma_address(sg);
6b4cd727eaf15e Peter Griffin 2016-10-18 541 else
6b4cd727eaf15e Peter Griffin 2016-10-18 542 hw_node->daddr = sg_dma_address(sg);
6b4cd727eaf15e Peter Griffin 2016-10-18 543
6b4cd727eaf15e Peter Griffin 2016-10-18 544 hw_node->nbytes = sg_dma_len(sg);
6b4cd727eaf15e Peter Griffin 2016-10-18 545 hw_node->generic.length = sg_dma_len(sg);
6b4cd727eaf15e Peter Griffin 2016-10-18 546 }
6b4cd727eaf15e Peter Griffin 2016-10-18 547
6b4cd727eaf15e Peter Griffin 2016-10-18 548 /* interrupt at end of last node */
6b4cd727eaf15e Peter Griffin 2016-10-18 @549 hw_node->control |= FDMA_NODE_CTRL_INT_EON;
6b4cd727eaf15e Peter Griffin 2016-10-18 550
6b4cd727eaf15e Peter Griffin 2016-10-18 551 return vchan_tx_prep(&fchan->vchan, &fdesc->vdesc, flags);
6b4cd727eaf15e Peter Griffin 2016-10-18 552 }
6b4cd727eaf15e Peter Griffin 2016-10-18 553

:::::: The code at line 549 was first introduced by commit
:::::: 6b4cd727eaf15ed225b9a3a96ec1d64761ee728a dmaengine: st_fdma: Add STMicroelectronics FDMA engine driver support

:::::: TO: Peter Griffin <[email protected]>
:::::: CC: Vinod Koul <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

Attachments:

(No filename) (5.46 kB)
.config.gz (70.29 kB)
Download all attachments

2019-10-29 02:40:37

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

On Fri, Oct 25, 2019 at 02:33:58PM -0700, Sultan Alsawaf wrote:
>
> Signed-off-by: Sultan Alsawaf <[email protected]>
> include/linux/scatterlist.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 556ec1ea2574..73f7fd6702d7 100644
> +++ b/include/linux/scatterlist.h
> @@ -146,7 +146,10 @@ static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
> * Loop over each sg element, following the pointer to a new list if necessary
> */
> #define for_each_sg(sglist, sg, nr, __i) \
> - for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))
> + for (__i = 0, sg = (sglist); __i < (nr); \
> + likely(++__i % (SG_MAX_SINGLE_ALLOC - 1) || \
> + (__i + 1) >= (nr)) ? sg++ : \
> + (sg = sg_chain_ptr(sg + 1)))

This is a big change in the algorithm, why are you sure it is OK?

Did you compare with just inlining sg_net?

Jason

2019-10-29 06:00:30

by Sultan Alsawaf

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

On Mon, Oct 28, 2019 at 11:17:34AM -0300, Jason Gunthorpe wrote:
> This is a big change in the algorithm, why are you sure it is OK?

I'm sure it's OK because the test module I provided in the commit message
encapsulates all the possible edge cases of sg chaining:
-An sglist with >=1 && <=(SG_MAX_SINGLE_ALLOC-1) nents (no chaining, the last
element in the array is unused)
-An sglist with SG_MAX_SINGLE_ALLOC nents (no chaining, the last element in the
array isn't an sg chain link)
-An sglist with >SG_MAX_SINGLE_ALLOC && <=2*(SG_MAX_SINGLE_ALLOC-1) nents (there
is one chain to another array, and the other array's last element is unused)
-An sglist with (2*SG_MAX_SINGLE_ALLOC)-1 nents (there is one chain to another
array, and the other array's last element isn't an sg chain link)
-An sglist with 2*SG_MAX_SINGLE_ALLOC nents (there are two chains to other
arrays, and the 3rd array contains 2 sgs & its last element is unused)
-An sglist with >2*SG_MAX_SINGLE_ALLOC && <(3*SG_MAX_SINGLE_ALLOC)-1 nents
(there are two chains to other arrays, and the 3rd array's last element isn't
an sg chain)

I just made my module test nents >=1 && <=3*SG_MAX_SINGLE_ALLOC for simplicity.
My proposed for_each_sg() also handles nents==0 the same as before by doing
nothing.

> Did you compare with just inlining sg_net?

Yes. Forcefully inlining sg_next() had no impact on performance.

Sultan

2019-10-29 06:01:43

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

On Mon, Oct 28, 2019 at 09:18:48AM -0700, Sultan Alsawaf wrote:
> On Mon, Oct 28, 2019 at 11:17:34AM -0300, Jason Gunthorpe wrote:
> > This is a big change in the algorithm, why are you sure it is OK?
>
> I'm sure it's OK because the test module I provided in the commit message
> encapsulates all the possible edge cases of sg chaining:
> -An sglist with >=1 && <=(SG_MAX_SINGLE_ALLOC-1) nents (no chaining, the last
> element in the array is unused)
> -An sglist with SG_MAX_SINGLE_ALLOC nents (no chaining, the last element in the
> array isn't an sg chain link)
> -An sglist with >SG_MAX_SINGLE_ALLOC && <=2*(SG_MAX_SINGLE_ALLOC-1) nents (there
> is one chain to another array, and the other array's last element is unused)
> -An sglist with (2*SG_MAX_SINGLE_ALLOC)-1 nents (there is one chain to another
> array, and the other array's last element isn't an sg chain link)
> -An sglist with 2*SG_MAX_SINGLE_ALLOC nents (there are two chains to other
> arrays, and the 3rd array contains 2 sgs & its last element is unused)
> -An sglist with >2*SG_MAX_SINGLE_ALLOC && <(3*SG_MAX_SINGLE_ALLOC)-1 nents
> (there are two chains to other arrays, and the 3rd array's last element isn't
> an sg chain)

This testing is making assumptions about how 'nr' is used and the
construction of the sgl though

If any chains are partially populated, or for some reason the driver
starts at a different sgl, it will break. You'll need to somehow
show none of those possibilities are happening.

Jason

2019-10-29 06:06:51

by Sultan Alsawaf

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

On Mon, Oct 28, 2019 at 01:23:20PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 28, 2019 at 09:18:48AM -0700, Sultan Alsawaf wrote:
> > On Mon, Oct 28, 2019 at 11:17:34AM -0300, Jason Gunthorpe wrote:
> > > This is a big change in the algorithm, why are you sure it is OK?
> >
> > I'm sure it's OK because the test module I provided in the commit message
> > encapsulates all the possible edge cases of sg chaining:
> > -An sglist with >=1 && <=(SG_MAX_SINGLE_ALLOC-1) nents (no chaining, the last
> > element in the array is unused)
> > -An sglist with SG_MAX_SINGLE_ALLOC nents (no chaining, the last element in the
> > array isn't an sg chain link)
> > -An sglist with >SG_MAX_SINGLE_ALLOC && <=2*(SG_MAX_SINGLE_ALLOC-1) nents (there
> > is one chain to another array, and the other array's last element is unused)
> > -An sglist with (2*SG_MAX_SINGLE_ALLOC)-1 nents (there is one chain to another
> > array, and the other array's last element isn't an sg chain link)
> > -An sglist with 2*SG_MAX_SINGLE_ALLOC nents (there are two chains to other
> > arrays, and the 3rd array contains 2 sgs & its last element is unused)
> > -An sglist with >2*SG_MAX_SINGLE_ALLOC && <(3*SG_MAX_SINGLE_ALLOC)-1 nents
> > (there are two chains to other arrays, and the 3rd array's last element isn't
> > an sg chain)
>
> This testing is making assumptions about how 'nr' is used and the
> construction of the sgl though
>
> If any chains are partially populated, or for some reason the driver
> starts at a different sgl, it will break. You'll need to somehow
> show none of those possibilities are happening.

I haven't personally witnessed any for_each_sg() use that starts at a different
sgl, but this is something I considered as well. The optimal solution would be
to alter for_each_sg() to replace the `sg` and `nr` arguments with a single
sg_table pointer argument instead. Users that genuinely need to start at a
different sgl can just craft their own for-loop with sg_next.

This would require changes to a lot of files though, and I wanted to see how the
mailing list felt about this change before embarking on that.

Sultan

2019-10-29 07:25:46

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] scatterlist: Speed up for_each_sg() loop macro

On Fri, Oct 25, 2019 at 02:33:58PM -0700, Sultan Alsawaf wrote:
> From: Sultan Alsawaf <[email protected]>
>
> Scatterlists are chained in predictable arrays of up to
> SG_MAX_SINGLE_ALLOC sg structs in length. Using this knowledge, speed up
> for_each_sg() by using constant operations to determine when to simply
> increment the sg pointer by one or get the next sg array in the chain.
>
> Rudimentary measurements with a trivial loop body show that this yields
> roughly a 2x performance gain.
>
> The following simple test module proves the correctness of the new loop
> definition by testing all the different edge cases of sg chains:
> #include <linux/module.h>
> #include <linux/scatterlist.h>
> #include <linux/slab.h>
>
> static int __init test_for_each_sg(void)
> {
> static const gfp_t gfp_flags = GFP_KERNEL | __GFP_NOFAIL;
> struct scatterlist *sg;
> struct sg_table *table;
> long old = 0, new = 0;
> unsigned int i, nents;
>
> table = kmalloc(sizeof(*table), gfp_flags);
> for (nents = 1; nents <= 3 * SG_MAX_SINGLE_ALLOC; nents++) {
> BUG_ON(sg_alloc_table(table, nents, gfp_flags));
> for (sg = table->sgl; sg; sg = sg_next(sg))
> old ^= (long)sg;
> for_each_sg(table->sgl, sg, nents, i)
> new ^= (long)sg;
> sg_free_table(table);
> }
>
> BUG_ON(old != new);
> kfree(table);
> return 0;
> }
> module_init(test_for_each_sg);
>
> Signed-off-by: Sultan Alsawaf <[email protected]>
> ---
> include/linux/scatterlist.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 556ec1ea2574..73f7fd6702d7 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -146,7 +146,10 @@ static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
> * Loop over each sg element, following the pointer to a new list if necessary
> */
> #define for_each_sg(sglist, sg, nr, __i) \
> - for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))
> + for (__i = 0, sg = (sglist); __i < (nr); \
> + likely(++__i % (SG_MAX_SINGLE_ALLOC - 1) || \
> + (__i + 1) >= (nr)) ? sg++ : \
> + (sg = sg_chain_ptr(sg + 1)))
>

sg_alloc_table_chained() may put a small sglist as the first chunk, then
chained with big one, and your patch breaks such usage.

Thanks,
Ming

2019-11-01 03:27:47

by kernel test robot

[permalink] [raw]

Subject: [scatterlist] 8f39742f03: suspend_stress.fail

FYI, we noticed the following commit (built with gcc-7):

commit: 8f39742f0383b3228de59f3d0e69715b8ec7eb61 ("[PATCH] scatterlist: Speed up for_each_sg() loop macro")
url: https://github.com/0day-ci/linux/commits/Sultan-Alsawaf/scatterlist-Speed-up-for_each_sg-loop-macro/20191029-045656

in testcase: suspend_stress
with following parameters:

mode: freeze
iterations: 10

on test machine: 4 threads BroadWell with 8G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

test started
(then system reboots upon the first freeze.
with commit 8005803a2ca0af49f36f6e9329b5ecda3df27347, we can
always do 10 iterations of freeze.)

To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

Thanks,
Oliver Sang

Attachments:

(No filename) (1.02 kB)
config-5.4.0-rc5-00020-g8f39742f0383b (203.86 kB)
job-script (4.94 kB)
suspend_stress (14.00 B)
job.yaml (3.82 kB)
Download all attachments