2010-01-21 04:51:33

by Don Mullis

[permalink] [raw]
Subject: [PATCH 1/2] lib: more scalable list_sort()

The use of list_sort() by UBIFS looks like it could generate long
lists; this alternative implementation scales better, reaching ~3x
performance gain as list length approaches the L2 cache size.

Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
size is 552 bytes versus 405 for Mark J. Roberts' code.

Worst case for either implementation is a list length just over a POT,
and to roughly the same degree, so here are results for a range of
2^N+1 lengths. List elements were 16 bytes each including malloc
overhead; random initial order.

time (msec)
Roberts
| Mullis
loop_count length | | ratio
2000000 3 188 211 1.122
1000000 5 193 158 0.818
500000 9 232 160 0.689
250000 17 235 161 0.685
125000 33 254 178 0.700
62500 65 284 200 0.704
31250 129 301 213 0.707
15625 257 313 224 0.715
7812 513 332 237 0.713
3906 1025 361 261 0.722
1953 2049 390 276 0.707 ~ L1 size
976 4097 553 323 0.584
488 8193 678 362 0.533
244 16385 771 396 0.513
122 32769 848 430 0.507
61 65537 918 458 0.498
30 131073 1216 550 0.452
15 262145 2557 961 0.375 ~ L2 size
7 524289 5650 1764 0.312
3 1048577 6287 2141 0.340
1 2097153 3650 1588 0.435


Mark's code does not actually implement the usual or generic
mergesort, but rather a variant from Simon Tatham described here:

http://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html

Simon's algorithm performs O(log N) passes over the entire input list,
doing merges of sublists that double in size on each pass. The
generic algorithm instead merges pairs of equal length lists as early
as possible, in recursive order. For either algorithm, the elements
that extend the list beyond power-of-two length are a special case
handled as nearly as possible as a "rounding-up" to a full POT.

Some intuition for the locality of reference implications of merge
order may be gotten by watching this animation:

http://www.sorting-algorithms.com/merge-sort

Simon's algorithm requires only O(1) extra space rather than the
generic algorithm's O(log N), but in my non-recursive implementation
the actual O(log N) data is merely a vector of ~20 pointers, which
I've put on the stack.

Stability of the sort: distinct elements that compare equal emerge
from the sort in the same order as with Mark's code, for simple test
cases.

A kernel that uses drm.ko appears to run normally with this change; I
have no suitable hardware to similarly test the use by UBIFS.

Signed-off-by: Don Mullis <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
---
lib/list_sort.c | 135 +++++++++++++++++++++++++++++---------------------------
1 file changed, 70 insertions(+), 65 deletions(-)

Index: linux-2.6/lib/list_sort.c
===================================================================
--- linux-2.6.orig/lib/list_sort.c 2010-01-19 20:10:31.000000000 -0800
+++ linux-2.6/lib/list_sort.c 2010-01-19 20:19:26.000000000 -0800
@@ -4,94 +4,99 @@
#include <linux/slab.h>
#include <linux/list.h>

+#define MAX_LIST_SIZE_BITS 20
+
+static struct list_head *merge(void *priv,
+ int (*cmp)(void *priv, struct list_head *a,
+ struct list_head *b),
+ struct list_head *a, struct list_head *b)
+{
+ struct list_head head, *tail = &head;
+
+ while (a && b) {
+ /* if equal, take 'a' -- important for sort stability */
+ if ((*cmp)(priv, a, b) <= 0) {
+ tail->next = a;
+ a = a->next;
+ } else {
+ tail->next = b;
+ b = b->next;
+ }
+ tail = tail->next;
+ }
+ tail->next = a?:b;
+ return head.next;
+}
+
+static void restore_back_links(struct list_head *head)
+{
+ struct list_head *p;
+
+ for (p = head; p->next; p = p->next)
+ p->next->prev = p;
+ head->prev = p;
+}
+
/**
* list_sort - sort a list.
* @priv: private data, passed to @cmp
* @head: the list to sort
* @cmp: the elements comparison function
*
- * This function has been implemented by Mark J Roberts <[email protected]>. It
- * implements "merge sort" which has O(nlog(n)) complexity. The list is sorted
- * in ascending order.
+ * This function implements "merge sort" which has O(nlog(n)) complexity.
+ * The list is sorted in ascending order.
*
* The comparison function @cmp is supposed to return a negative value if @a is
* less than @b, and a positive value if @a is greater than @b. If @a and @b
* are equivalent, then it does not matter what this function returns.
*/
void list_sort(void *priv, struct list_head *head,
- int (*cmp)(void *priv, struct list_head *a,
- struct list_head *b))
+ int (*cmp)(void *priv, struct list_head *a,
+ struct list_head *b))
{
- struct list_head *p, *q, *e, *list, *tail, *oldhead;
- int insize, nmerges, psize, qsize, i;
+ struct list_head *part[MAX_LIST_SIZE_BITS+1]; /* sorted partial lists --
+ last slot a sentinel */
+ int lev; /* index into part[] */
+ int max_lev = -1;
+ struct list_head *list;

if (list_empty(head))
return;

+ memset(part, 0, sizeof(part));
+
+ /*
+ * Rather than take time to maintain the back links through the merges,
+ * we'll recreate them afterward. Null-terminate the input list too.
+ */
list = head->next;
- list_del(head);
- insize = 1;
- for (;;) {
- p = oldhead = list;
- list = tail = NULL;
- nmerges = 0;
-
- while (p) {
- nmerges++;
- q = p;
- psize = 0;
- for (i = 0; i < insize; i++) {
- psize++;
- q = q->next == oldhead ? NULL : q->next;
- if (!q)
- break;
- }
+ head->prev->next = NULL;

- qsize = insize;
- while (psize > 0 || (qsize > 0 && q)) {
- if (!psize) {
- e = q;
- q = q->next;
- qsize--;
- if (q == oldhead)
- q = NULL;
- } else if (!qsize || !q) {
- e = p;
- p = p->next;
- psize--;
- if (p == oldhead)
- p = NULL;
- } else if (cmp(priv, p, q) <= 0) {
- e = p;
- p = p->next;
- psize--;
- if (p == oldhead)
- p = NULL;
- } else {
- e = q;
- q = q->next;
- qsize--;
- if (q == oldhead)
- q = NULL;
- }
- if (tail)
- tail->next = e;
- else
- list = e;
- e->prev = tail;
- tail = e;
+ while (list) {
+ struct list_head *cur = list;
+ list = list->next;
+ cur->next = NULL;
+
+ for (lev = 0; part[lev]; lev++) {
+ cur = merge(priv, cmp, part[lev], cur);
+ part[lev] = NULL;
+ }
+ if (lev > max_lev) {
+ if (unlikely(lev >= ARRAY_SIZE(part)-1)) {
+ printk_once(KERN_DEBUG "list passed to"
+ " list_sort() too long for"
+ " efficiency\n");
+ lev--;
}
- p = q;
+ max_lev = lev;
}
+ part[lev] = cur;
+ }

- tail->next = list;
- list->prev = tail;
-
- if (nmerges <= 1)
- break;
+ for (lev = 0; lev <= max_lev; lev++)
+ list = merge(priv, cmp, list, part[lev]);

- insize *= 2;
- }
+ restore_back_links(list);

head->next = list;
head->prev = list->prev;


2010-01-21 05:17:33

by Don Mullis

[permalink] [raw]
Subject: [PATCH 2/2] lib: revise list_sort() comment

Clarify and correct header comment of list_sort().

Signed-off-by: Don Mullis <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
---
lib/list_sort.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/lib/list_sort.c
===================================================================
--- linux-2.6.orig/lib/list_sort.c 2010-01-19 22:26:03.000000000 -0800
+++ linux-2.6/lib/list_sort.c 2010-01-19 22:28:19.000000000 -0800
@@ -38,17 +38,19 @@ static void restore_back_links(struct li
}

/**
- * list_sort - sort a list.
- * @priv: private data, passed to @cmp
+ * list_sort - sort a list
+ * @priv: private data, opaque to list_sort(), passed to @cmp
* @head: the list to sort
* @cmp: the elements comparison function
*
- * This function implements "merge sort" which has O(nlog(n)) complexity.
- * The list is sorted in ascending order.
+ * This function implements "merge sort", which has O(nlog(n))
+ * complexity.
*
- * The comparison function @cmp is supposed to return a negative value if @a is
- * less than @b, and a positive value if @a is greater than @b. If @a and @b
- * are equivalent, then it does not matter what this function returns.
+ * The comparison function @cmp must return a negative value if @a
+ * should sort before @b, and a positive value if @a should sort after
+ * @b. If @a and @b are equivalent, and their original relative
+ * ordering is to be preserved, @cmp should return 0; otherwise, the
+ * return value does not matter.
*/
void list_sort(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a,

2010-01-21 09:24:50

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.
>
> Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> size is 552 bytes versus 405 for Mark J. Roberts' code.
>
> Worst case for either implementation is a list length just over a POT,
> and to roughly the same degree, so here are results for a range of
> 2^N+1 lengths. List elements were 16 bytes each including malloc
> overhead; random initial order.
>

Could you please add a debugging function which would be compiled-out
normally, and which would check that on the output 'list_sort()' gives
really sorted list, and number of elements in the list stays the same.
You'd call this function before returning from list_sort(). Something
like:

#ifdef DEBUG_LIST_SORT
static int list_check(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a,
struct list_head *b))
{
/* Checking */
}
#else
#define list_check(priv, head, cmp) 0
#endif

This will provide more confidence in the algorithm correctness for
everyone who modifies 'list_sort()'.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-21 09:55:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
> On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > The use of list_sort() by UBIFS looks like it could generate long
> > lists; this alternative implementation scales better, reaching ~3x
> > performance gain as list length approaches the L2 cache size.
> >
> > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> > size is 552 bytes versus 405 for Mark J. Roberts' code.
> >
> > Worst case for either implementation is a list length just over a POT,
> > and to roughly the same degree, so here are results for a range of
> > 2^N+1 lengths. List elements were 16 bytes each including malloc
> > overhead; random initial order.
> >
>
> Could you please add a debugging function which would be compiled-out
> normally, and which would check that on the output 'list_sort()' gives
> really sorted list, and number of elements in the list stays the same.
> You'd call this function before returning from list_sort(). Something
> like:
>
> #ifdef DEBUG_LIST_SORT
> static int list_check(void *priv, struct list_head *head,
> int (*cmp)(void *priv, struct list_head *a,
> struct list_head *b))
> {
> /* Checking */
> }
> #else
> #define list_check(priv, head, cmp) 0
> #endif
>
> This will provide more confidence in the algorithm correctness for
> everyone who modifies 'list_sort()'.

I'd suggest the same method as employed in lib/sort.c - a
simple userspace program that verifies correct operation is included
in lib/sort.c....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-01-21 11:47:21

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Thu, 2010-01-21 at 20:54 +1100, Dave Chinner wrote:
> On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
> > On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > > The use of list_sort() by UBIFS looks like it could generate long
> > > lists; this alternative implementation scales better, reaching ~3x
> > > performance gain as list length approaches the L2 cache size.
> > >
> > > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > > gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> > > size is 552 bytes versus 405 for Mark J. Roberts' code.
> > >
> > > Worst case for either implementation is a list length just over a POT,
> > > and to roughly the same degree, so here are results for a range of
> > > 2^N+1 lengths. List elements were 16 bytes each including malloc
> > > overhead; random initial order.
> > >
> >
> > Could you please add a debugging function which would be compiled-out
> > normally, and which would check that on the output 'list_sort()' gives
> > really sorted list, and number of elements in the list stays the same.
> > You'd call this function before returning from list_sort(). Something
> > like:
> >
> > #ifdef DEBUG_LIST_SORT
> > static int list_check(void *priv, struct list_head *head,
> > int (*cmp)(void *priv, struct list_head *a,
> > struct list_head *b))
> > {
> > /* Checking */
> > }
> > #else
> > #define list_check(priv, head, cmp) 0
> > #endif
> >
> > This will provide more confidence in the algorithm correctness for
> > everyone who modifies 'list_sort()'.
>
> I'd suggest the same method as employed in lib/sort.c - a
> simple userspace program that verifies correct operation is included
> in lib/sort.c....

Yeah, that's also an option.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-21 16:34:49

by Don Mullis

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Artem Bityutskiy <[email protected]> writes:

> On Thu, 2010-01-21 at 20:54 +1100, Dave Chinner wrote:
>> On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
>> >
>> > Could you please add a debugging function which would be compiled-out
>> > normally, and which would check that on the output 'list_sort()' gives
>> > really sorted list, and number of elements in the list stays the same.
>> > You'd call this function before returning from list_sort(). Something
>> > like:
>> >
>> > #ifdef DEBUG_LIST_SORT
>> > static int list_check(void *priv, struct list_head *head,
>> > int (*cmp)(void *priv, struct list_head *a,
>> > struct list_head *b))
>> > {
>> > /* Checking */
>> > }
>> > #else
>> > #define list_check(priv, head, cmp) 0
>> > #endif
>> >
>> > This will provide more confidence in the algorithm correctness for
>> > everyone who modifies 'list_sort()'.
>>
>> I'd suggest the same method as employed in lib/sort.c - a
>> simple userspace program that verifies correct operation is included
>> in lib/sort.c....
>
> Yeah, that's also an option.

Okay. The regression test in lib/sort.c is kernel-space, run once at
boot. I'd like to do something similar for lib/list_sort.c, conditioned
on DEBUG_LIST_SORT. I would extend the testing to verify stability as
well as sort order and number of elements.

2010-01-21 17:59:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Wed, Jan 20, 2010 at 08:51:26PM -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.

If this can really be called with long lists
the function likely needs (optional) need_resched()s
Otherwise it could ruin scheduling latencies.

-Andi

2010-01-21 19:33:32

by Olaf Titz

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib: revise list_sort() comment

> + * The comparison function @cmp must return a negative value if @a
> + * should sort before @b, and a positive value if @a should sort after
> + * @b. If @a and @b are equivalent, and their original relative
> + * ordering is to be preserved, @cmp should return 0; otherwise, the
> + * return value does not matter.

This "otherwise... does not matter" sounds funny and confusing. Either
read this as "the return value does not matter if it is neither <0, >0
or ==0" or "the return value does not matter if the function wants it
to be ignored". :-)

Just omitting the "otherwise" clause would be clearer.

Olaf

2010-01-22 03:17:22

by Don Mullis

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Andi Kleen <[email protected]> writes:

> On Wed, Jan 20, 2010 at 08:51:26PM -0800, Don Mullis wrote:
>> The use of list_sort() by UBIFS looks like it could generate long
>> lists; this alternative implementation scales better, reaching ~3x
>> performance gain as list length approaches the L2 cache size.
>
> If this can really be called with long lists
> the function likely needs (optional) need_resched()s
> Otherwise it could ruin scheduling latencies.
>
> -Andi

Being just a dumb library routine, list_sort() has no idea what context
it's been called in, how long a list a particular client could pass in,
nor how expensive the client's cmp() callback might be.

The cmp() callback already passes back a client-private pointer.
Hanging off of this could be a count of calls, or timing information,
maintained by the client. Whenever some threshold is reached, the
client's cmp() could do whatever good CPU-sharing citizenship required.

This doesn't address the final O(n) pass over the list to restore the
back links. So the cost of that pass would dictate the upper limit on
list length for a client already using the cmp() call-counting/timing
trick to break up the earlier compare-and-merge passes.

If that's not good enough, a more complicated solution would be
required. But I'm hoping we don't need to go there yet.

2010-01-22 04:54:47

by Don Mullis

[permalink] [raw]
Subject: Re: [PATCH 2/2] lib: revise list_sort() comment

Olaf Titz <[email protected]> writes:

>> + * The comparison function @cmp must return a negative value if @a
>> + * should sort before @b, and a positive value if @a should sort after
>> + * @b. If @a and @b are equivalent, and their original relative
>> + * ordering is to be preserved, @cmp should return 0; otherwise, the
>> + * return value does not matter.
>
> This "otherwise... does not matter" sounds funny and confusing. Either
> read this as "the return value does not matter if it is neither <0, >0
> or ==0" or "the return value does not matter if the function wants it
> to be ignored". :-)
>
> Just omitting the "otherwise" clause would be clearer.
>
> Olaf

Okay, I will simplify the wording.

2010-01-22 10:43:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Don Mullis <[email protected]> writes:
>
> Being just a dumb library routine, list_sort() has no idea what context
> it's been called in, how long a list a particular client could pass in,
> nor how expensive the client's cmp() callback might be.
>
> The cmp() callback already passes back a client-private pointer.
> Hanging off of this could be a count of calls, or timing information,
> maintained by the client. Whenever some threshold is reached, the
> client's cmp() could do whatever good CPU-sharing citizenship required.

need_resched() does all the timing/thresholding (it checks the
reschedule flag set by the timer interrupt). You just have to call it.
But preferable not in the inner loop, but in a outer one. It's
not hyper-expensive, but it's not free either.

The drawback is that if it's called the context always has to
allow sleeping, so it might need to be optional.

Anyways a better fix might be simply to ensure in the caller
that lists never get as long that they become a scheduling
hazard. But you indicated that ubifs would pass very long lists?
Perhaps ubifs (and other calls who might have that problem) simply
needs to be fixed.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-22 12:29:51

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Fri, 2010-01-22 at 11:43 +0100, Andi Kleen wrote:
> Don Mullis <[email protected]> writes:
> >
> > Being just a dumb library routine, list_sort() has no idea what context
> > it's been called in, how long a list a particular client could pass in,
> > nor how expensive the client's cmp() callback might be.
> >
> > The cmp() callback already passes back a client-private pointer.
> > Hanging off of this could be a count of calls, or timing information,
> > maintained by the client. Whenever some threshold is reached, the
> > client's cmp() could do whatever good CPU-sharing citizenship required.
>
> need_resched() does all the timing/thresholding (it checks the
> reschedule flag set by the timer interrupt). You just have to call it.
> But preferable not in the inner loop, but in a outer one. It's
> not hyper-expensive, but it's not free either.
>
> The drawback is that if it's called the context always has to
> allow sleeping, so it might need to be optional.
>
> Anyways a better fix might be simply to ensure in the caller
> that lists never get as long that they become a scheduling
> hazard. But you indicated that ubifs would pass very long lists?
> Perhaps ubifs (and other calls who might have that problem) simply
> needs to be fixed.

No, they are not very long. A hundred or so I guess, rarely. But we need
to check what is really the worst case, but it should not be too many.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-22 17:55:48

by Don Mullis

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Artem Bityutskiy <[email protected]> writes:

> On Fri, 2010-01-22 at 11:43 +0100, Andi Kleen wrote:
>> Don Mullis <[email protected]> writes:
>> >
>> > Being just a dumb library routine, list_sort() has no idea what context
>> > it's been called in, how long a list a particular client could pass in,
>> > nor how expensive the client's cmp() callback might be.
>> >
>> > The cmp() callback already passes back a client-private pointer.
>> > Hanging off of this could be a count of calls, or timing information,
>> > maintained by the client. Whenever some threshold is reached, the
>> > client's cmp() could do whatever good CPU-sharing citizenship required.
>>
>> need_resched() does all the timing/thresholding (it checks the
>> reschedule flag set by the timer interrupt). You just have to call it.
>> But preferable not in the inner loop, but in a outer one. It's
>> not hyper-expensive, but it's not free either.
>>
>> The drawback is that if it's called the context always has to
>> allow sleeping, so it might need to be optional.
>>
>> Anyways a better fix might be simply to ensure in the caller
>> that lists never get as long that they become a scheduling
>> hazard. But you indicated that ubifs would pass very long lists?
>> Perhaps ubifs (and other calls who might have that problem) simply
>> needs to be fixed.
>
> No, they are not very long. A hundred or so I guess, rarely. But we need
> to check what is really the worst case, but it should not be too many.

I suggest for now we leave scheduling issues as the caller's
responsibility, and keep list_sort() simple. Wouldn't want to be
getting any email like this:

http://lwn.net/Articles/366768/

2010-01-23 08:29:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Fri, Jan 22, 2010 at 11:43:21AM +0100, Andi Kleen wrote:
> Don Mullis <[email protected]> writes:
> >
> > Being just a dumb library routine, list_sort() has no idea what context
> > it's been called in, how long a list a particular client could pass in,
> > nor how expensive the client's cmp() callback might be.
> >
> > The cmp() callback already passes back a client-private pointer.
> > Hanging off of this could be a count of calls, or timing information,
> > maintained by the client. Whenever some threshold is reached, the
> > client's cmp() could do whatever good CPU-sharing citizenship required.
>
> need_resched() does all the timing/thresholding (it checks the
> reschedule flag set by the timer interrupt). You just have to call it.
> But preferable not in the inner loop, but in a outer one. It's
> not hyper-expensive, but it's not free either.
>
> The drawback is that if it's called the context always has to
> allow sleeping, so it might need to be optional.
>
> Anyways a better fix might be simply to ensure in the caller
> that lists never get as long that they become a scheduling
> hazard. But you indicated that ubifs would pass very long lists?
> Perhaps ubifs (and other calls who might have that problem) simply
> needs to be fixed.

Burning CPU time to save on IO is a very valid tradeoff in
filesystem design - burning a few hundred millieseconds of CPU
time can result in savcwinge tens of seconds of IO time. Hence
passing big long lists to be sorted is not an indication of broken
design, it's an indication of understanding CPU time vs IO time
tradeoffs during design...

For example, the use I have for list sort (sorting delayed write
buffers in ascending block order before Io submission) will result
in XFS asking for lists in the order of tens to hundreds of
thousands of items to be sorted. The code I already is showing wall
clock time reductions of 30-40% for stuff like kernel tarball
extraction or creation of millions of small files. This is because
the number of buffers being issued for IO is far larger than we
should sanely hold in the elevator request queues, but pre-sorting
them results in the elevator getting a near 100% merge efficiency of
delayed write buffers instead of near zero. Hence we get much, much
better IO patterns out of the filesystem and things go faster.

Hence I think scalability and minimising scheduling latency for
list_sort() is definitely important. I was kind of planning to address
this when the problem was a performance limiting factor, but Don
has made a pre-emptive strike. ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-01-23 11:35:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

> Burning CPU time to save on IO is a very valid tradeoff in
> filesystem design - burning a few hundred millieseconds of CPU
> time can result in savcwinge tens of seconds of IO time. Hence
> passing big long lists to be sorted is not an indication of broken
> design, it's an indication of understanding CPU time vs IO time
> tradeoffs during design...

Burning long CPU time in kernel code without latency breaker code is always
a sign of broken design. When you burn you have to check for
reschedules. It's that simple.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-23 16:06:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Sat, Jan 23, 2010 at 12:35:51PM +0100, Andi Kleen wrote:
> > Burning CPU time to save on IO is a very valid tradeoff in
> > filesystem design - burning a few hundred millieseconds of CPU
> > time can result in savcwinge tens of seconds of IO time. Hence
> > passing big long lists to be sorted is not an indication of broken
> > design, it's an indication of understanding CPU time vs IO time
> > tradeoffs during design...
>
> Burning long CPU time in kernel code without latency breaker code is always
> a sign of broken design.

It's a characteristic of a sub-optimal implementation, not bad
design. Plenty of code has been fixed over the years simply by
adding cond_resched() to loops that have triggered latency
warnings.

Similarly, adding cond_resched() to list_sort means you can stop
worrying about the scheduling latency impact of sorting long lists.
I fail to see why you're making such a big deal out of this.....

Cheers

Dave.
--
Dave Chinner
[email protected]

2010-01-24 20:59:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

> Similarly, adding cond_resched() to list_sort means you can stop
> worrying about the scheduling latency impact of sorting long lists.
> I fail to see why you're making such a big deal out of this.....

It's not easy to add it to a low level library function like list_sort()
because you would need to ensure that all callers are allowed to sleep.

Or even if all callers currently allow it future ones might not.

So it would need a new flag or so if needed, completely changing
the interface.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-24 21:10:41

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Sun, 2010-01-24 at 21:59 +0100, Andi Kleen wrote:
> > Similarly, adding cond_resched() to list_sort means you can stop
> > worrying about the scheduling latency impact of sorting long lists.
> > I fail to see why you're making such a big deal out of this.....
>
> It's not easy to add it to a low level library function like list_sort()
> because you would need to ensure that all callers are allowed to sleep.
>
> Or even if all callers currently allow it future ones might not.
>
> So it would need a new flag or so if needed, completely changing
> the interface.

Don made IMO a good proposal for the caller to add cond_reshed() in its
'cmp()' callback, if needed. Shouldn't that work fine?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-01-24 22:38:46

by Don Mullis

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

Artem Bityutskiy <[email protected]> writes:

> Don made IMO a good proposal for the caller to add cond_reshed() in its
> 'cmp()' callback, if needed. Shouldn't that work fine?

In the list_sort() implementation I posted, there was an O(n) loop that
contained no calls back to cmp(). The V2 implementation I'm testing
right now fixes that -- all inner loops will call back to cmp(). It's
also a bit faster, though ~100 bytes bigger.

If there's any other objection to pushing responsibility for calling
cond_resched() back to the client, via the cmp() routine, please, let's
hear it.

2010-01-25 03:42:33

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Sun, Jan 24, 2010 at 09:59:06PM +0100, Andi Kleen wrote:
> > Similarly, adding cond_resched() to list_sort means you can stop
> > worrying about the scheduling latency impact of sorting long lists.
> > I fail to see why you're making such a big deal out of this.....
>
> It's not easy to add it to a low level library function like list_sort()
> because you would need to ensure that all callers are allowed to sleep.

might_sleep() annotations are typically used in this case....

> Or even if all callers currently allow it future ones might not.

You're trying to make this way more complex than the current
requirements need it to be. Just declaring a simple rule -
"list_sort() can sleep" - and all these problems go away until
someone really needs list sorting in atomic context.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-08-04 14:06:59

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.
>
> Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> size is 552 bytes versus 405 for Mark J. Roberts' code.
>
> Worst case for either implementation is a list length just over a POT,
> and to roughly the same degree, so here are results for a range of
> 2^N+1 lengths. List elements were 16 bytes each including malloc
> overhead; random initial order.

This patch breaks UBIFS. I did not have time to dig deeper, but the
symptoms is that list_sort() calls the 'cmp()' function with bogus
'struct list_head *a' parameter, which did not exist in the original
list.

I see this on 2.6.35-rc1. Did not try the release yet. But when I revert
your patch - everything works fine.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-08-07 07:52:57

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH 1/2] lib: more scalable list_sort()

On Wed, 2010-08-04 at 17:04 +0300, Artem Bityutskiy wrote:
> On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > The use of list_sort() by UBIFS looks like it could generate long
> > lists; this alternative implementation scales better, reaching ~3x
> > performance gain as list length approaches the L2 cache size.
> >
> > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > gcc-4.4, with flags extracted from an Ubuntu kernel build. Object
> > size is 552 bytes versus 405 for Mark J. Roberts' code.
> >
> > Worst case for either implementation is a list length just over a POT,
> > and to roughly the same degree, so here are results for a range of
> > 2^N+1 lengths. List elements were 16 bytes each including malloc
> > overhead; random initial order.
>
> This patch breaks UBIFS. I did not have time to dig deeper, but the
> symptoms is that list_sort() calls the 'cmp()' function with bogus
> 'struct list_head *a' parameter, which did not exist in the original
> list.

Sorry, it appeared to be that UBIFS 'cmp()' function is broken, so your
patches revealed the issue. Sorry for noise and thanks for revealing
UBIFS problem!

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)