2004-04-02 08:21:37

by Antony Suter

[permalink] [raw]
Subject: 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli


Here is an update to my tidy set of patches. I was a fan of WLI's
patchset until he discontinued it in the 2.6.0-test era. They were a
small set of patches with performance improvements for laptops and NUMA
machines amongst others... (wish I had a laptop NUMA machine *cough*).
Some of those patches are now in the kernel proper. Some others have
been updated and are found elsewhere, like the objrmap series can be
found in Andrea Archangeli's -aa series. I'm starting to add some of the
other patches from WLI's last release, depending on my abolity to
resolve rejections. The numbers relate directly to those from
linux-2.6.0-test11-wli-1.tar.bz2

Patches were applied in the following order:
- Con Kolivas' new starcase cpu scheduler patch 5.2
- Jens Axboe's cfq io scheduler
- Andrea Archangeli's 2.6.5-rc3-aa2.bz2
< from linux-2.6.0-test11-wli-1 >
- #17 convert copy_strings() to use kmap_atomic() instead of kmap()
- #19 node-local i386 per_cpu areas
- #22 increase static vfs hashtable and VM array sizes
- #24 /proc/ BKL gunk plus page wait hashtable sizing adjustment
- #25 invalidate_inodes() speedup

Linqs:
http://www.users.on.net/sutera/2.6.5-rc3-as1.patch.gz
http://www.users.on.net/sutera/2.6.5-rc3-as1.patch.gz.sign

Note that to use the cfq scheduler to need to add "elevator=cfq" to your
kernel command line. This is usually done in your lilo or grub (or
equivalent) config.

--
- Antony Suter (sutera internode on net) "Bonta"
- "...through shadows falling, out of memory and time..."


2004-04-02 10:14:29

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli

On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote:
> Here is an update to my tidy set of patches. I was a fan of WLI's
> patchset until he discontinued it in the 2.6.0-test era. They were a
> small set of patches with performance improvements for laptops and NUMA
> machines amongst others... (wish I had a laptop NUMA machine *cough*).
> Some of those patches are now in the kernel proper. Some others have
> been updated and are found elsewhere, like the objrmap series can be
> found in Andrea Archangeli's -aa series. I'm starting to add some of the
> other patches from WLI's last release, depending on my abolity to
> resolve rejections. The numbers relate directly to those from
> linux-2.6.0-test11-wli-1.tar.bz2

Ouch! Please, use either Hugh's or Andrea's up-to-date patches.
anobjrmap (and actually the vast majority of this material) was not
original. You're also unlikely to find highpmd and a number of others
useful without highmem and/or ia32 NUMA.

I'm honestly not sure how you managed to merge anything with all the
highpmd/O(1) proc_pid_statm()/anobjrmap bits in there. Heck, I lost
the stamina to keep it going myself.


On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote:
> Patches were applied in the following order:
> - Con Kolivas' new starcase cpu scheduler patch 5.2
> - Jens Axboe's cfq io scheduler
> - Andrea Archangeli's 2.6.5-rc3-aa2.bz2

Phew, aa's bits should be maintained/updated/bugfixed.


On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote:
> < from linux-2.6.0-test11-wli-1 >
> - #17 convert copy_strings() to use kmap_atomic() instead of kmap()
> - #19 node-local i386 per_cpu areas

These two are completely useless unless you're running on bigfathighmem.


On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote:
> - #22 increase static vfs hashtable and VM array sizes
> - #24 /proc/ BKL gunk plus page wait hashtable sizing adjustment
> - #25 invalidate_inodes() speedup

#25 is in -mm and may have bugfixes/updates relative to whatever I had.
#22 and #24 don't have very definite impacts; I only saw any difference
with truly massive amounts of IO in-flight (e.g. 25GB/48GB), and while
it wasn't a clear win, it moved around the blocking points surrounding
IO submission to places where I'd rather it go. This is a rather foggy
issue you probably shouldn't be concerned with. If you do see it make a
difference, I'll be surprised, but happy to see the benchmark numbers
demonstrating it.

To understand what was really going on with all this, it helps to know
why this thing was put together. That was to optimize a benchmark I can't
name without having problems with respect to publication rules and so on.
This was actually done in somewhat of a hurry as it was meant more as a
demonstration of benchmarking methodology than producing important
results in and of themselves, but then the results were maintained for
far longer than the actual optimization effort lasted, as they appeared
to be valuable. I believed the results of this to be potentially useful
to end users because the benchmark was a simulation of interactive
workloads where people interacted with shells and spawned short-running
jobs meant to be typical for shellservers in university- like
environments, though the age of the benchmark hurt its relevance quite
a bit. Active development was pursued on other things while the -wli
patches were maintained. As it aged, most of the highly experimental
parts were backed out instead of debugged to address stability issues,
and eventually the entire patch coctail imploded as the series of 10-15
patches in a row that stomped over every line modifying a user pte
developed poor interactions I didn't have the bandwidth to address in
addition to my regular duties.

While the optimization effort was ongoing, my general approach was to
hunt for patches to forward port or "combine" as opposed to producing
anything original. Many of these things were motivated by a combination
of a priori reasoning about what was going on with various profile-based
hints. For instance, on ia32 NUMA, all lowmem is on node 0. I observed
that pagetable _teardown_ was expensive, and this in the loops over
pmd's to remove pagetable pages. My attack was to hoist the pmd's into
node-local memory by shoving it into highmem, where Andrea's pte-highmem
and Arjan and Ingo's highpte were both strong precedents. I combined the
two approaches, as Andrea placed pmd's in highmem, while Arjan and Ingo
used kmap_atomic() and the like to avoid kmap_lock overhead and so on.
All that was, in fact, after some abortive attempts to punt pagetable
teardown to keventd, which while mechanically successful (i.e. the code
worked) was not effective as a performance improvement.

Many of the other patches were even more direct equivalents of some
predecessor. For instance, this unnameable benchmark spawns ps(1) very
frequently to simulate users monitoring their own workloads. This then
very heavily stresses the /proc/ VM reporting code and the /proc/ vfs
code. To address this, I ported whatever vfs RCU code I could that
maneesh and dipankar had written, and _also_ ported bcrl's O(1)
proc_pid_statm() from RH's 2.4.9, which resolved semaphore contention
issues and more general algorithmic efficiency issues in /proc/
reporting. With that and various BKL-related /proc/ adjustments, the
/proc/ -stressing components were speeded up greatly. This differs a
lot from other attacks on this benchmark, where the benchmark is
altered so the parts stressing /proc/ are removed. I also had in the
back of my mind the notion that /proc/ performance improvements would
be appreciated by end users with limited cpu power to devote to the
monitoring of their workloads and machines' performance, which is part
of what motivated me to do it "the hard way" instead of modifying the
benchmark or replacing the userspace procps utilities with /dev/kmem
-diving utilities.

The general points this is all meant to illustrate are that some of the
cherrypicking going on doesn't really make sense, and to give the
background on where all this stuff came from so you can understand
which parts are going to be useful to you if you do choose to cherrypick
them. I very much regret not arranging relative benchmark results to
post, as they are very impressive for not having exploited the extreme
NUMA characteristics of the test machines. In the very strict sense of
the slope of the curve as the number of processors increases, the
original patch set was measured to literally double the kernel's
scalability in this benchmark, which is something I'm rather proud of.
There were other approaches which exploited the NUMA hardware aspects
to achieve more drastic results with less code, but had more limited
applicability as non-NUMA machines didn't benefit from them at all.

Also, there will be new -wli's. They will be vastly different in nature
from the prior -wli's. I don't like repeating myself. I already
acknowledged the precedents available to me in the 2.5.74 era. The new
-wli's won't be as heavily influenced by precedents and will be of a
substantially different character from the prior releases. I'm taking my
time to do this and for a good reason. I don't want to do it half-assed.

It may not be VM. It may not be any one thing. What it _will_ be (unlike
some of the prior -wli code) is up to my own personal coding standards,
which you may rest assured are rather high.

And finally, even with all this longwinded harangue, congratulations on
your tree. There are very definite feelings of importance and
satisfaction of having done service from producing releases others rely
upon. And these are real, as real users do benefit from what you've
assembled. I'm more than happy to help if you have bugreports in any
code I maintained or other need to call on me. And whatever precedent I
may have provided, you do own this, and this is your own original work.


-- wli

2004-04-02 16:43:57

by Antony Suter

[permalink] [raw]
Subject: Re: 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli

On Fri, 2004-04-02 at 20:14, William Lee Irwin III wrote:
> On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote:
> > Some of those patches are now in the kernel proper. Some others have
> > been updated and are found elsewhere, like the objrmap series can be
> > found in Andrea Archangeli's -aa series. I'm starting to add some of the
> > other patches from WLI's last release, depending on my ability to
> > resolve rejections. The numbers relate directly to those from
> > linux-2.6.0-test11-wli-1.tar.bz2
>
> Ouch! Please, use either Hugh's or Andrea's up-to-date patches.
> anobjrmap (and actually the vast majority of this material) was not
> original. You're also unlikely to find highpmd and a number of others
> useful without highmem and/or ia32 NUMA.

I certainly want to use the most up to date versions if others have
continued that work. Pointers appreciated.

> On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote:
> > < from linux-2.6.0-test11-wli-1 >
> > - #17 convert copy_strings() to use kmap_atomic() instead of kmap()
> > - #19 node-local i386 per_cpu areas
>
> These two are completely useless unless you're running on bigfathighmem.
>
> On Fri, Apr 02, 2004 at 06:21:25PM +1000, Antony Suter wrote:
> > - #22 increase static vfs hashtable and VM array sizes
> > - #24 /proc/ BKL gunk plus page wait hashtable sizing adjustment
> > - #25 invalidate_inodes() speedup
>
> #25 is in -mm and may have bugfixes/updates relative to whatever I had.
> #22 and #24 don't have very definite impacts; I only saw any difference
> with truly massive amounts of IO in-flight (e.g. 25GB/48GB), and while
> it wasn't a clear win, it moved around the blocking points surrounding
> IO submission to places where I'd rather it go. This is a rather foggy
> issue you probably shouldn't be concerned with. If you do see it make a
> difference, I'll be surprised, but happy to see the benchmark numbers
> demonstrating it.

Could you add some outlines of your patches #02, #03, #18 and #28
please?

> [...]
> altered so the parts stressing /proc/ are removed. I also had in the
> back of my mind the notion that /proc/ performance improvements would
> be appreciated by end users with limited cpu power to devote to the
> monitoring of their workloads and machines' performance, which is part
> of what motivated me to do it "the hard way" instead of modifying the
> benchmark or replacing the userspace procps utilities with /dev/kmem
> -diving utilities.

How important would improvements to /proc be now we have /sys ?

> The general points this is all meant to illustrate are that some of the
> cherrypicking going on doesn't really make sense, and to give the
> background on where all this stuff came from so you can understand
> which parts are going to be useful to you if you do choose to cherrypick
> them.

I certainly want to grok the purpose of each and every patch. This
release might would have been larger and out sooner if not for a reverse
patch cascade meltdown.

> I very much regret not arranging relative benchmark results to
> post, as they are very impressive for not having exploited the extreme
> NUMA characteristics of the test machines. In the very strict sense of
> the slope of the curve as the number of processors increases, the
> original patch set was measured to literally double the kernel's
> scalability in this benchmark, which is something I'm rather proud of.
> There were other approaches which exploited the NUMA hardware aspects
> to achieve more drastic results with less code, but had more limited
> applicability as non-NUMA machines didn't benefit from them at all.

For as long as this series continues, I would want to include patches
that improve performance in any area, so long as the overall effect is
positive. And no improvement too great or small. I must have more power.

> Also, there will be new -wli's. They will be vastly different in nature
> from the prior -wli's. I don't like repeating myself. I already

I look forward to it ;)

> And finally, even with all this longwinded harangue, congratulations on
> your tree. There are very definite feelings of importance and
> satisfaction of having done service from producing releases others rely
> upon. And these are real, as real users do benefit from what you've
> assembled. I'm more than happy to help if you have bugreports in any
> code I maintained or other need to call on me. And whatever precedent I
> may have provided, you do own this, and this is your own original work.

Thanks for your kind words, and detailed notes! Again, any pointers to
similar sorts of work would be greatly appreciated. Can you recommend
any good tools for patch set management?

Cheers.

--
- Antony Suter (suterant users sourceforge net) "Bonta"
- "...through shadows falling, out of memory and time..."


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-04-02 23:45:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.5-rc3-as1 patchset, cks5.2, cfq, aa1, and some wli

On Sat, Apr 03, 2004 at 02:43:28AM +1000, Antony Suter wrote:
> Could you add some outlines of your patches #02, #03, #18 and #28
> please?

#2 rewrote the page allocator to do deferred coalescing, which had the
additional advantage of making transfers of groups of pages to and from
the lists protected by zone->lock into expected O(1) operations. It also
exported the new functionality of O(1) batched page freeing to callers,
which was utilized by #3.

#3 implemented caching of preconstructed leaf pagetable nodes in a manner
compatible with highpte. This was supposed to conserve cache, and may
improve performance on load that repetitively fork() and exit().

#18 just micro-optimized some page allocator logic and enlarged the
batches so as to take advantage of the operations newly made O(1) by #2
which would otherwise have been expensive with large batches. It's not
actually useful without #2 in place.

#28 put all scheduling primitives into their own ELF sections,
delimited by new marker symbols, and uses those to improve
/proc/$PID/wchan reporting so that various scheduling primitives are
skipped over that aren't now, and scheduling functions don't need to
be contiguous in the text segment of the kernel. I've resubmitted this
a few more times, and it seems to be destined for mainline.


At some point in the past, I wrote:
>> altered so the parts stressing /proc/ are removed. I also had in the
>> back of my mind the notion that /proc/ performance improvements would
>> be appreciated by end users with limited cpu power to devote to the
>> monitoring of their workloads and machines' performance, which is part
>> of what motivated me to do it "the hard way" instead of modifying the
>> benchmark or replacing the userspace procps utilities with /dev/kmem
>> -diving utilities.

On Sat, Apr 03, 2004 at 02:43:28AM +1000, Antony Suter wrote:
> How important would improvements to /proc be now we have /sys ?

These were largely performance-oriented, not functionality. A rather
unfortunate aspect of that benchmark was that it effectively benchmarked
dozens or hundreds of processes doing ps(1) in parallel. End users may
find that the overhead of running top(1) is reduced by the patches meant
to speed up /proc/ for the benchmark, as many of them were single-
threaded speedups, and not just locking improvements.

The most important of the /proc/ performance patches was actually #5,
the forward port of bcrl's O(1) proc_pid_statm(). #4, the rbtree-based
get_tgid_list()/get_tid_list(), may also prove useful, and could use
some benchmarking on it as a standalone patch done.


At some point in the past, I wrote:
>> And finally, even with all this longwinded harangue, congratulations on
>> your tree. There are very definite feelings of importance and
>> satisfaction of having done service from producing releases others rely
>> upon. And these are real, as real users do benefit from what you've
>> assembled. I'm more than happy to help if you have bugreports in any
>> code I maintained or other need to call on me. And whatever precedent I
>> may have provided, you do own this, and this is your own original work.

On Sat, Apr 03, 2004 at 02:43:28AM +1000, Antony Suter wrote:
> Thanks for your kind words, and detailed notes! Again, any pointers to
> similar sorts of work would be greatly appreciated. Can you recommend
> any good tools for patch set management?

I've discovered quilt (based on akpm's patch scripts) is excellent and
am replacing my old scripts, which confused everyone but me, with it.


-- wli