This is version 2 of my Task Watchers patches.
Task watchers calls functions whenever a task forks, execs, changes its
[re][ug]id, or exits.
Task watchers is primarily useful to existing kernel code as a means of making
the code in fork and exit more readable. Kernel code uses these paths by
marking a function as a task watcher much like modules mark their init
functions with module_init(). This improves the readability of copy_process().
The first patch adds the basic infrastructure of task watchers: notification
function calls in the various paths and a table of function pointers to be
called. It uses an ELF section because parts of the table must be gathered
from all over the kernel code and using the linker is easier than resolving
and maintaining complex header interdependencies. An ELF table is also ideal
because its "readonly" nature means that no locking nor list traversal are
required.
Subsequent patches adapt existing parts of the kernel to use a task watcher
-- typically in the fork, clone, and exit paths:
FEATURE (notes) RELEVANT CONFIG VARIABLE
-----------------------------------------------------------------------
audit [ CONFIG_AUDIT ... ]
semundo [ CONFIG_SYSVIPC ]
cpusets [ CONFIG_CPUSETS ]
mempolicy [ CONFIG_NUMA ]
trace irqflags [ CONFIG_TRACE_IRQFLAGS ]
lockdep [ CONFIG_LOCKDEP ]
keys (for processes -- not for thread groups) [ CONFIG_KEYS ]
process events connector [ CONFIG_PROC_EVENTS ]
TODO:
Mark the task watcher table ELF section read-only. I've tried to "fix"
the .lds files to do this with no success. I'd really appreciate help
from folks familiar with writing linker scripts.
I'm working on three more patches that add support for creating a task
watcher from within a module using an ELF section. They haven't recieved
as much attention since I've been focusing on measuring the performance
impact of these patches.
Changes:
since v2 RFC:
Updated to 2.6.19-rc2-mm2
Compiled, booted, tested, and benchmarked
Testing
Booted with audit=1 profile=2
Enabled profiling tools
Enabled auditing
Ran random syscall test
IRQ trace and lockdep CONFIG=y not tested
Benchmarks
A clone benchmark (try to clone as fast as possible)
Unrealistic. Shows incremental cost of one task watcher
A fork benchmark (try to fork as fast as possible)
Unrealistic. Shows incremental cost of one task watcher
Kernbench
Closer to realistic.
Result summaries follow changelog
See patches for details
Fork and clone samples available on request (too large for email)
Fork and clone benchmark sources will be posted as replies to 00
v2:
Dropped use of notifier chains
Dropped per-task watchers
Can be implemented on top of this
Still requires notifier chains
Dropped taskstats conversion
Parts of taskstats had to move away from the regions of
copy_process() and do_exit() where task_watchers are notified
Used linker script mechanism suggested by Al Viro
Created one "list" of watchers per event as requested by Andrew Morton
No need to multiplex a single function call
Easier to static register/unregister watchers: 1 line of code
val param now used for:
WATCH_TASK_INIT: clone_flags
WATCH_TASK_CLONE: clone_flags
WATCH_TASK_EXIT: exit code
WATCH_TASK_*: <unused>
Renamed notify_watchers() to notify_task_watchers()
Replaced: if (err != 0) --> if (err)
Added patches converting more "features" to use task watchers
Added return code handling to WATCH_TASK_INIT
Return code handling elsewhere didn't seem appropriate
since there was generally no response necessary
Fixed process keys free to handle failure in fork as originally coded
in copy_process
Added process keys code to watch for [er][ug]id changes
v1:
Added ability to cause fork to fail with NOTIFY_STOP_MASK
Added WARN_ON() when watchers cause WATCH_TASK_FREE to stop early
Moved fork invocation
Moved exec invocation
Added current as argument to exec invocation
Moved exit code assignment
Added id change invocations
(70 insertions)
v0:
Based on Jes Sorensen's Task Notifiers patches (posted to LSE-Tech)
Benchmark result summaries (sorry, this part is 86 columns):
System: 4 1.7GHz ppc64 (Power 4+) processors, 30968600MB RAM, 2.6.19-rc2-mm2 kernel
Clone - Incremental worst-case costs measured in tasks/second and as a percentage of
expected rate
Patch
1 2 3 4 5 6 7 8 9
--------------------------------------------------------------------------------------
Incremental
Cost (tasks/s) -38.12 12.5 -84 25.2 -187.5 -0.5834 -11.36 -125.2 -64.05
Cost Err 122.3 17.84 67.11 61.03 41.8 34.64 45.53 58.28 53.18
Cost (%) -0.2 0.07 -0.5 0.1 -1 -0.004 -0.06 -0.7 -0.4
Cost Err (%) 0.7 0.1 0.4 0.3 0.2 0.2 0.2 0.3 0.3
Fork - Incremental worst-case costs measured in tasks/second and as a percentage of
expected rate
Patch
1 2 3 4 5 6 7 8 9
--------------------------------------------------------------------------------------
Incremental
Cost (tasks/s) -64.58 -35.74 -33.29 -25.8 -139.5 -7.311 -9.2 -131.4 -50.47
Cost Err 54.09 27.58 41.76 42.47 49.87 60.94 29.72 39.7 40.89
Cost (%) -0.3 -0.2 -0.2 -0.1 -0.8 -0.04 -0.05 -0.7 -0.3
Cost Err (%) 0.3 0.2 0.2 0.2 0.3 0.3 0.2 0.2 0.2
Kernbench Measurements
Patch Elapsed(s) User(s) System(s) CPU(%)
- 124.406 439.947 46.615 390.700 <-- baseline 2.6.19-rc2-mm2
1 124.353 439.935 46.334 390.400
2 124.234 439.700 46.503 390.800
3 124.248 439.830 46.258 390.700
4 124.357 439.753 46.582 390.600
5 124.333 439.787 46.491 390.700
6 124.532 439.732 46.497 389.900
7 124.359 439.756 46.457 390.300
8 124.272 439.643 46.320 390.500
9 124.400 439.787 46.485 390.300
Mean: 124.349 439.787 46.454 390.490
Stddev: 0.087641 0.095917 0.115309 0.272641
Kernbench - Incremental costs
Patch Elapsed(s) User(s) System(s) CPU(%)
1 -0.053 -0.012 -0.281 -0.3
2 -0.119 -0.235 0.169 0.4
3 0.014 0.130 -0.245 -0.1
4 0.109 -0.077 0.324 -0.1
5 -0.024 0.034 -0.091 0.1
6 0.199 -0.055 0.006 -0.8
7 -0.173 0.024 -0.040 0.4
8 -0.087 -0.113 -0.137 0.2
9 0.128 0.144 0.165 -0.2
Mean: 0.005875 -0.0185 0.018875 -0.0125
Stddev: 0.13094 0.12738 0.1877 0.39074
Andrew, please consider these patches for 2.6.20's -mm tree.
Cheers,
-Matt Helsley
--
Matt wrote:
> Task watchers is primarily useful to existing kernel code as a means of making
> the code in fork and exit more readable.
I don't get it. The benchmark data isn't explained in plain English
what it means, that I could find, so I am just guessing. But looking
at the last (17500) column of the fork results, after applying patch
1/9, I see a number of 18565, and looking at that same column in patch
9/9, I see a number of 18142.
I guess that means a drop of (18565 - 18142 / 18565) == 2% in the fork
rate, to make the code "more readable".
And I'm not even sure it makes it more readable. Looks to me like another
layer of apparatus, which is one more thing to figure out before a reader
understands what is going on.
I'd gladly put in a few long days to improve the fork rate 2%, and I am
grateful to those who have already done so - whoever they are.
Somewhere I must have missed the memo explaining why this patch is a
good idea - sorry.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Fri, 2006-11-03 at 00:57 -0800, Paul Jackson wrote:
> Matt wrote:
> > Task watchers is primarily useful to existing kernel code as a means of making
> > the code in fork and exit more readable.
>
> I don't get it. The benchmark data isn't explained in plain English
Sorry, there were no units in the per-patch fork and clone data. Units
there are in tasks created per second. The kernbench units are in place
and should be fairly self-explanatory I think.
Here's what I did:
Measure the time it takes to fork N times. Retry 100 times. Try
different N. Try clone instead of fork to see how different the results
can be.
Then run kernbench.
Do the above after applying each patch. Then compare to the previous
patch (or unpatched source).
Run statistics on the numbers.
> what it means, that I could find, so I am just guessing. But looking
> at the last (17500) column of the fork results, after applying patch
> 1/9, I see a number of 18565, and looking at that same column in patch
> 9/9, I see a number of 18142.
>
> I guess that means a drop of (18565 - 18142 / 18565) == 2% in the fork
> rate, to make the code "more readable".
Well, it's a worst-case scenario. Without the patches I've seen the
fork rate intermittently (once every 300 samples) drop to 16k forks/sec
-- a much bigger drop than 2%. I also ran the tests on Andrew's hotfix
patches for rc2-mm2 and got similar differences even though the patches
don't change the fork path. And finally, don't forget to compare that to
the error -- about +/-1.6%. So on an absolute worst-case workload we
could have a drop anywhere from 0.4 to 3.6%.
To get a better idea of the normal impact of these patches I think you
have to look at benchmarks more like kernbench since it's not comprised
entirely of fork calls. There the measurements are easily within the
error margins with or without the patches.
Unfortunately the differences I get always seem to be right around the
size of the error. I can't seem to get a benchmark to have an error of
1% or less. I'm open to suggestions of different benchmarks or how to
obtain tighter bounds on the measurements (e.g. /proc knobs to fiddle
with).
> And I'm not even sure it makes it more readable. Looks to me like another
> layer of apparatus, which is one more thing to figure out before a reader
> understands what is going on.
It's nice to see a module's init function with the rest of the module
and not cluttering up the kernel's module loading code. The use,
benefits, disadvantages, and even the implementation of task watchers
are similar. I could rename it (task_init(), task_exit(), etc.) to make
the similarity more apparent.
> I'd gladly put in a few long days to improve the fork rate 2%, and I am
> grateful to those who have already done so - whoever they are.
I'm open to suggestions on how to improve the performance. :)
> Somewhere I must have missed the memo explaining why this patch is a
> good idea - sorry.
Well, it should make things look cleaner. It's also intended to be
useful in new code like containers and resource management -- pieces
many people don't want to pay attention to in those paths.
Cheers,
-Matt Helsley