LinuxLists.cc - [PATCH 1/2] lib: more scalable list

[permalink] [raw]

Subject: [PATCH 2/2] lib: revise list_sort() comment

Clarify and correct header comment of list_sort().

Signed-off-by: Don Mullis <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
---
lib/list_sort.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/lib/list_sort.c
===================================================================
--- linux-2.6.orig/lib/list_sort.c 2010-01-19 22:26:03.000000000 -0800
+++ linux-2.6/lib/list_sort.c 2010-01-19 22:28:19.000000000 -0800
@@ -38,17 +38,19 @@ static void restore_back_links(struct li
}

/**
- * list_sort - sort a list.
- * @priv: private data, passed to @cmp
+ * list_sort - sort a list
+ * @priv: private data, opaque to list_sort(), passed to @cmp
* @head: the list to sort
* @cmp: the elements comparison function
*
- * This function implements "merge sort" which has O(nlog(n)) complexity.
- * The list is sorted in ascending order.
+ * This function implements "merge sort", which has O(nlog(n))
+ * complexity.
*
- * The comparison function @cmp is supposed to return a negative value if @a is
- * less than @b, and a positive value if @a is greater than @b. If @a and @b
- * are equivalent, then it does not matter what this function returns.
+ * The comparison function @cmp must return a negative value if @a
+ * should sort before @b, and a positive value if @a should sort after
+ * @b. If @a and @b are equivalent, and their original relative
+ * ordering is to be preserved, @cmp should return 0; otherwise, the
+ * return value does not matter.
*/
void list_sort(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a,

2010-01-21 09:24:50

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.
>
> Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> size is 552 bytes versus 405 for Mark J. Roberts' code.
>
> Worst case for either implementation is a list length just over a POT,
> and to roughly the same degree, so here are results for a range of
> 2^N+1 lengths. List elements were 16 bytes each including malloc
> overhead; random initial order.
>

Could you please add a debugging function which would be compiled-out
normally, and which would check that on the output 'list_sort()' gives
really sorted list, and number of elements in the list stays the same.
You'd call this function before returning from list_sort(). Something
like:

#ifdef DEBUG_LIST_SORT
static int list_check(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a,
struct list_head *b))
{
/* Checking */
}
#else
#define list_check(priv, head, cmp) 0
#endif

This will provide more confidence in the algorithm correctness for
everyone who modifies 'list_sort()'.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-21 09:55:16

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
> On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > The use of list_sort() by UBIFS looks like it could generate long
> > lists; this alternative implementation scales better, reaching ~3x
> > performance gain as list length approaches the L2 cache size.
> >
> > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> > size is 552 bytes versus 405 for Mark J. Roberts' code.
> >
> > Worst case for either implementation is a list length just over a POT,
> > and to roughly the same degree, so here are results for a range of
> > 2^N+1 lengths. List elements were 16 bytes each including malloc
> > overhead; random initial order.
> >
>
> Could you please add a debugging function which would be compiled-out
> normally, and which would check that on the output 'list_sort()' gives
> really sorted list, and number of elements in the list stays the same.
> You'd call this function before returning from list_sort(). Something
> like:
>
> #ifdef DEBUG_LIST_SORT
> static int list_check(void *priv, struct list_head *head,
> int (*cmp)(void *priv, struct list_head *a,
> struct list_head *b))
> {
> /* Checking */
> }
> #else
> #define list_check(priv, head, cmp) 0
> #endif
>
> This will provide more confidence in the algorithm correctness for
> everyone who modifies 'list_sort()'.

I'd suggest the same method as employed in lib/sort.c - a
simple userspace program that verifies correct operation is included
in lib/sort.c....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-01-21 11:47:21

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Thu, 2010-01-21 at 20:54 +1100, Dave Chinner wrote:
> On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
> > On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > > The use of list_sort() by UBIFS looks like it could generate long
> > > lists; this alternative implementation scales better, reaching ~3x
> > > performance gain as list length approaches the L2 cache size.
> > >
> > > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > > gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> > > size is 552 bytes versus 405 for Mark J. Roberts' code.
> > >
> > > Worst case for either implementation is a list length just over a POT,
> > > and to roughly the same degree, so here are results for a range of
> > > 2^N+1 lengths. List elements were 16 bytes each including malloc
> > > overhead; random initial order.
> > >
> >
> > Could you please add a debugging function which would be compiled-out
> > normally, and which would check that on the output 'list_sort()' gives
> > really sorted list, and number of elements in the list stays the same.
> > You'd call this function before returning from list_sort(). Something
> > like:
> >
> > #ifdef DEBUG_LIST_SORT
> > static int list_check(void *priv, struct list_head *head,
> > int (*cmp)(void *priv, struct list_head *a,
> > struct list_head *b))
> > {
> > /* Checking */
> > }
> > #else
> > #define list_check(priv, head, cmp) 0
> > #endif
> >
> > This will provide more confidence in the algorithm correctness for
> > everyone who modifies 'list_sort()'.
>
> I'd suggest the same method as employed in lib/sort.c - a
> simple userspace program that verifies correct operation is included
> in lib/sort.c....

Yeah, that's also an option.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-21 16:34:49

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Artem Bityutskiy <[email protected]> writes:

> On Thu, 2010-01-21 at 20:54 +1100, Dave Chinner wrote:
>> On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
>> >
>> > Could you please add a debugging function which would be compiled-out
>> > normally, and which would check that on the output 'list_sort()' gives
>> > really sorted list, and number of elements in the list stays the same.
>> > You'd call this function before returning from list_sort(). Something
>> > like:
>> >
>> > #ifdef DEBUG_LIST_SORT
>> > static int list_check(void *priv, struct list_head *head,
>> > int (*cmp)(void *priv, struct list_head *a,
>> > struct list_head *b))
>> > {
>> > /* Checking */
>> > }
>> > #else
>> > #define list_check(priv, head, cmp) 0
>> > #endif
>> >
>> > This will provide more confidence in the algorithm correctness for
>> > everyone who modifies 'list_sort()'.
>>
>> I'd suggest the same method as employed in lib/sort.c - a
>> simple userspace program that verifies correct operation is included
>> in lib/sort.c....
>
> Yeah, that's also an option.

Okay. The regression test in lib/sort.c is kernel-space, run once at
boot. I'd like to do something similar for lib/list_sort.c, conditioned
on DEBUG_LIST_SORT. I would extend the testing to verify stability as
well as sort order and number of elements.

2010-01-21 17:59:20

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Wed, Jan 20, 2010 at 08:51:26PM -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.

If this can really be called with long lists
the function likely needs (optional) need_resched()s
Otherwise it could ruin scheduling latencies.

-Andi

2010-01-21 19:33:32

by Olaf Titz

[permalink] [raw]

Subject: Re: [PATCH 2/2] lib: revise list_sort() comment

> + * The comparison function @cmp must return a negative value if @a
> + * should sort before @b, and a positive value if @a should sort after
> + * @b. If @a and @b are equivalent, and their original relative
> + * ordering is to be preserved, @cmp should return 0; otherwise, the
> + * return value does not matter.

This "otherwise... does not matter" sounds funny and confusing. Either
read this as "the return value does not matter if it is neither <0, >0
or ==0" or "the return value does not matter if the function wants it
to be ignored". :-)

Just omitting the "otherwise" clause would be clearer.

Olaf

2010-01-22 03:17:22

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Andi Kleen <[email protected]> writes:

> On Wed, Jan 20, 2010 at 08:51:26PM -0800, Don Mullis wrote:
>> The use of list_sort() by UBIFS looks like it could generate long
>> lists; this alternative implementation scales better, reaching ~3x
>> performance gain as list length approaches the L2 cache size.
>
> If this can really be called with long lists
> the function likely needs (optional) need_resched()s
> Otherwise it could ruin scheduling latencies.
>
> -Andi

Being just a dumb library routine, list_sort() has no idea what context
it's been called in, how long a list a particular client could pass in,
nor how expensive the client's cmp() callback might be.

The cmp() callback already passes back a client-private pointer.
Hanging off of this could be a count of calls, or timing information,
maintained by the client. Whenever some threshold is reached, the
client's cmp() could do whatever good CPU-sharing citizenship required.

This doesn't address the final O(n) pass over the list to restore the
back links. So the cost of that pass would dictate the upper limit on
list length for a client already using the cmp() call-counting/timing
trick to break up the earlier compare-and-merge passes.

If that's not good enough, a more complicated solution would be
required. But I'm hoping we don't need to go there yet.

2010-01-22 04:54:47

[permalink] [raw]

Subject: Re: [PATCH 2/2] lib: revise list_sort() comment

Olaf Titz <[email protected]> writes:

>> + * The comparison function @cmp must return a negative value if @a
>> + * should sort before @b, and a positive value if @a should sort after
>> + * @b. If @a and @b are equivalent, and their original relative
>> + * ordering is to be preserved, @cmp should return 0; otherwise, the
>> + * return value does not matter.
>
> This "otherwise... does not matter" sounds funny and confusing. Either
> read this as "the return value does not matter if it is neither <0, >0
> or ==0" or "the return value does not matter if the function wants it
> to be ignored". :-)
>
> Just omitting the "otherwise" clause would be clearer.
>
> Olaf

Okay, I will simplify the wording.

2010-01-22 10:43:25

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Don Mullis <[email protected]> writes:
>
> Being just a dumb library routine, list_sort() has no idea what context
> it's been called in, how long a list a particular client could pass in,
> nor how expensive the client's cmp() callback might be.
>
> The cmp() callback already passes back a client-private pointer.
> Hanging off of this could be a count of calls, or timing information,
> maintained by the client. Whenever some threshold is reached, the
> client's cmp() could do whatever good CPU-sharing citizenship required.

need_resched() does all the timing/thresholding (it checks the
reschedule flag set by the timer interrupt). You just have to call it.
But preferable not in the inner loop, but in a outer one. It's
not hyper-expensive, but it's not free either.

The drawback is that if it's called the context always has to
allow sleeping, so it might need to be optional.

Anyways a better fix might be simply to ensure in the caller
that lists never get as long that they become a scheduling
hazard. But you indicated that ubifs would pass very long lists?
Perhaps ubifs (and other calls who might have that problem) simply
needs to be fixed.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-22 12:29:51

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Fri, 2010-01-22 at 11:43 +0100, Andi Kleen wrote:
> Don Mullis <[email protected]> writes:
> >
> > Being just a dumb library routine, list_sort() has no idea what context
> > it's been called in, how long a list a particular client could pass in,
> > nor how expensive the client's cmp() callback might be.
> >
> > The cmp() callback already passes back a client-private pointer.
> > Hanging off of this could be a count of calls, or timing information,
> > maintained by the client. Whenever some threshold is reached, the
> > client's cmp() could do whatever good CPU-sharing citizenship required.
>
> need_resched() does all the timing/thresholding (it checks the
> reschedule flag set by the timer interrupt). You just have to call it.
> But preferable not in the inner loop, but in a outer one. It's
> not hyper-expensive, but it's not free either.
>
> The drawback is that if it's called the context always has to
> allow sleeping, so it might need to be optional.
>
> Anyways a better fix might be simply to ensure in the caller
> that lists never get as long that they become a scheduling
> hazard. But you indicated that ubifs would pass very long lists?
> Perhaps ubifs (and other calls who might have that problem) simply
> needs to be fixed.

No, they are not very long. A hundred or so I guess, rarely. But we need
to check what is really the worst case, but it should not be too many.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-22 17:55:48

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Artem Bityutskiy <[email protected]> writes:

> On Fri, 2010-01-22 at 11:43 +0100, Andi Kleen wrote:
>> Don Mullis <[email protected]> writes:
>> >
>> > Being just a dumb library routine, list_sort() has no idea what context
>> > it's been called in, how long a list a particular client could pass in,
>> > nor how expensive the client's cmp() callback might be.
>> >
>> > The cmp() callback already passes back a client-private pointer.
>> > Hanging off of this could be a count of calls, or timing information,
>> > maintained by the client. Whenever some threshold is reached, the
>> > client's cmp() could do whatever good CPU-sharing citizenship required.
>>
>> need_resched() does all the timing/thresholding (it checks the
>> reschedule flag set by the timer interrupt). You just have to call it.
>> But preferable not in the inner loop, but in a outer one. It's
>> not hyper-expensive, but it's not free either.
>>
>> The drawback is that if it's called the context always has to
>> allow sleeping, so it might need to be optional.
>>
>> Anyways a better fix might be simply to ensure in the caller
>> that lists never get as long that they become a scheduling
>> hazard. But you indicated that ubifs would pass very long lists?
>> Perhaps ubifs (and other calls who might have that problem) simply
>> needs to be fixed.
>
> No, they are not very long. A hundred or so I guess, rarely. But we need
> to check what is really the worst case, but it should not be too many.

I suggest for now we leave scheduling issues as the caller's
responsibility, and keep list_sort() simple. Wouldn't want to be
getting any email like this:

http://lwn.net/Articles/366768/

2010-01-23 08:29:21

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Fri, Jan 22, 2010 at 11:43:21AM +0100, Andi Kleen wrote:
> Don Mullis <[email protected]> writes:
> >
> > Being just a dumb library routine, list_sort() has no idea what context
> > it's been called in, how long a list a particular client could pass in,
> > nor how expensive the client's cmp() callback might be.
> >
> > The cmp() callback already passes back a client-private pointer.
> > Hanging off of this could be a count of calls, or timing information,
> > maintained by the client. Whenever some threshold is reached, the
> > client's cmp() could do whatever good CPU-sharing citizenship required.
>
> need_resched() does all the timing/thresholding (it checks the
> reschedule flag set by the timer interrupt). You just have to call it.
> But preferable not in the inner loop, but in a outer one. It's
> not hyper-expensive, but it's not free either.
>
> The drawback is that if it's called the context always has to
> allow sleeping, so it might need to be optional.
>
> Anyways a better fix might be simply to ensure in the caller
> that lists never get as long that they become a scheduling
> hazard. But you indicated that ubifs would pass very long lists?
> Perhaps ubifs (and other calls who might have that problem) simply
> needs to be fixed.

Burning CPU time to save on IO is a very valid tradeoff in
filesystem design - burning a few hundred millieseconds of CPU
time can result in savcwinge tens of seconds of IO time. Hence
passing big long lists to be sorted is not an indication of broken
design, it's an indication of understanding CPU time vs IO time
tradeoffs during design...

For example, the use I have for list sort (sorting delayed write
buffers in ascending block order before Io submission) will result
in XFS asking for lists in the order of tens to hundreds of
thousands of items to be sorted. The code I already is showing wall
clock time reductions of 30-40% for stuff like kernel tarball
extraction or creation of millions of small files. This is because
the number of buffers being issued for IO is far larger than we
should sanely hold in the elevator request queues, but pre-sorting
them results in the elevator getting a near 100% merge efficiency of
delayed write buffers instead of near zero. Hence we get much, much
better IO patterns out of the filesystem and things go faster.

Hence I think scalability and minimising scheduling latency for
list_sort() is definitely important. I was kind of planning to address
this when the problem was a performance limiting factor, but Don
has made a pre-emptive strike. ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-01-23 11:35:53

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

> Burning CPU time to save on IO is a very valid tradeoff in
> filesystem design - burning a few hundred millieseconds of CPU
> time can result in savcwinge tens of seconds of IO time. Hence
> passing big long lists to be sorted is not an indication of broken
> design, it's an indication of understanding CPU time vs IO time
> tradeoffs during design...

Burning long CPU time in kernel code without latency breaker code is always
a sign of broken design. When you burn you have to check for
reschedules. It's that simple.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-23 16:06:13

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Sat, Jan 23, 2010 at 12:35:51PM +0100, Andi Kleen wrote:
> > Burning CPU time to save on IO is a very valid tradeoff in
> > filesystem design - burning a few hundred millieseconds of CPU
> > time can result in savcwinge tens of seconds of IO time. Hence
> > passing big long lists to be sorted is not an indication of broken
> > design, it's an indication of understanding CPU time vs IO time
> > tradeoffs during design...
>
> Burning long CPU time in kernel code without latency breaker code is always
> a sign of broken design.

It's a characteristic of a sub-optimal implementation, not bad
design. Plenty of code has been fixed over the years simply by
adding cond_resched() to loops that have triggered latency
warnings.

Similarly, adding cond_resched() to list_sort means you can stop
worrying about the scheduling latency impact of sorting long lists.
I fail to see why you're making such a big deal out of this.....

Cheers

Dave.
--
Dave Chinner
[email protected]

2010-01-24 20:59:16

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

> Similarly, adding cond_resched() to list_sort means you can stop
> worrying about the scheduling latency impact of sorting long lists.
> I fail to see why you're making such a big deal out of this.....

It's not easy to add it to a low level library function like list_sort()
because you would need to ensure that all callers are allowed to sleep.

Or even if all callers currently allow it future ones might not.

So it would need a new flag or so if needed, completely changing
the interface.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-24 21:10:41

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Sun, 2010-01-24 at 21:59 +0100, Andi Kleen wrote:
> > Similarly, adding cond_resched() to list_sort means you can stop
> > worrying about the scheduling latency impact of sorting long lists.
> > I fail to see why you're making such a big deal out of this.....
>
> It's not easy to add it to a low level library function like list_sort()
> because you would need to ensure that all callers are allowed to sleep.
>
> Or even if all callers currently allow it future ones might not.
>
> So it would need a new flag or so if needed, completely changing
> the interface.

Don made IMO a good proposal for the caller to add cond_reshed() in its
'cmp()' callback, if needed. Shouldn't that work fine?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-24 22:38:46

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Artem Bityutskiy <[email protected]> writes:

> Don made IMO a good proposal for the caller to add cond_reshed() in its
> 'cmp()' callback, if needed. Shouldn't that work fine?

In the list_sort() implementation I posted, there was an O(n) loop that
contained no calls back to cmp(). The V2 implementation I'm testing
right now fixes that -- all inner loops will call back to cmp(). It's
also a bit faster, though ~100 bytes bigger.

If there's any other objection to pushing responsibility for calling
cond_resched() back to the client, via the cmp() routine, please, let's
hear it.

2010-01-25 03:42:33

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Sun, Jan 24, 2010 at 09:59:06PM +0100, Andi Kleen wrote:
> > Similarly, adding cond_resched() to list_sort means you can stop
> > worrying about the scheduling latency impact of sorting long lists.
> > I fail to see why you're making such a big deal out of this.....
>
> It's not easy to add it to a low level library function like list_sort()
> because you would need to ensure that all callers are allowed to sleep.

might_sleep() annotations are typically used in this case....

> Or even if all callers currently allow it future ones might not.

You're trying to make this way more complex than the current
requirements need it to be. Just declaring a simple rule -
"list_sort() can sleep" - and all these problems go away until
someone really needs list sorting in atomic context.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-08-04 14:06:59

[permalink] [raw]

Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.
>
> Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> size is 552 bytes versus 405 for Mark J. Roberts' code.
>
> Worst case for either implementation is a list length just over a POT,
> and to roughly the same degree, so here are results for a range of
> 2^N+1 lengths. List elements were 16 bytes each including malloc
> overhead; random initial order.

This patch breaks UBIFS. I did not have time to dig deeper, but the
symptoms is that list_sort() calls the 'cmp()' function with bogus
'struct list_head *a' parameter, which did not exist in the original
list.

I see this on 2.6.35-rc1. Did not try the release yet. But when I revert
your patch - everything works fine.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-08-07 07:52:57