2011-05-23 01:56:33

by Lucian Adrian Grijincu

[permalink] [raw]
Subject: [v3 00/39] faster tree-based sysctl implementation

Hi

This is version 3 of a patch series that introduces a faster/leaner
sysctl internal implementation.

Due to high number of patches and low general interest I'll just point
you to the tree/branch:

git://github.com/luciang/linux-2.6-new-sysctl.git v3-new-sysctl-alg


Patches are on top of v2.6.39. I did not pick a more recent (random)
point in Linus' tree to rebase these onto to not mess up testing.



Changes since v2
(http://thread.gmane.org/gmane.linux.kernel/1137032/focus=194748):
- added a compatibility layer to support old registering complex
sysctl trees. This layer will be deleted once all users of the old
are changed.
- subdirectories and netns correspondent dirs are now held in rbtrees
- split of from the patches that make changes in the rest of the tree
- rebased on top of 2.6.39




Changes since v1 (http://thread.gmane.org/gmane.linux.kernel/1133667):
- rebased on top of 2.6.39-rc6
- split the patch that adds the new algorithm and data structures.
- fixed a few bugs lingering in the old code
- shrinked a reference counter
- added a new reference counter to maintain ownership information
- added method to register an empty sysctl dir and converted some users
- added checks enforcing the rule that a non-netns specific directory may
not be registered after a netns specific one has already been registered.
- added cookie support: register a piece of data with the header to be
used to make simple conversions on the ctl_table. This saves memory where
we need to register sysctl tables with the same content affecting
different pieces of data.
- enforced sysctl checks



$ time modprobe dummy numdummies=N


NOTE: these stats are from v2. v3 should be a bit slower due to:
- the compatibility layer
- the old stats used cookies to prevent kmemdups() on ctl_table arrays
- the old patches had an optimisation for directories with many
subdirs that was replaced (in v3) with rbtrees

Without this patch series :(
- ipv4 only
- N=1000 time= 0m 06s
- N=2000 time= 0m 30s
- N=4000 time= 2m 35s
- ipv4 and ipv6
- N=1000 time= 0m 24s
- N=2000 time= 2m 14s
- N=4000 time=10m 16s
- N=5000 time=16m 3s

With this patch series :)
- ipv4 only
- N=1000 time= 0m 0.33s
- N=2000 time= 0m 1.25s
- N=4000 time= 0m 5.31s
- ipv4 and ipv6
- N=1000 time= 0m 0.41s
- N=2000 time= 0m 1.62s
- N=4000 time= 0m 7.64s
- N=5000 time= 0m 12.35s
- N=8000 time= 0m 36.95s




Patches marked with RFC: are patches where reviewers should pay more
attention as I may have missed something.


Part 1: introduce compatibility layer:
sysctl: introduce temporary sysctl wrappers
sysctl: register only tables of sysctl files


Part 2: minimal changes to sysctl users:
sysctl: call sysctl_init before the first sysctl registration
sysctl: no-child: manually register kernel/random
sysctl: no-child: manually register kernel/keys
sysctl: no-child: manually register fs/inotify
sysctl: no-child: manually register fs/epoll
sysctl: no-child: manually register root tables


Part 3: cleanups simplifying the new algorithm (created when asked to
break up the new algorithm patch:
sysctl: faster reimplementation of sysctl_check_table
sysctl: remove useless ctl_table->parent field
sysctl: simplify find_in_table
sysctl: sysctl_head_grab defaults to root header on NULL
sysctl: delete useless grab_header function
sysctl: rename ->used to ->ctl_use_refs
sysctl: rename sysctl_head_grab/finish to sysctl_use_header/unuse
sysctl: rename sysctl_head_next to sysctl_use_next_header
sysctl: split ->count into ctl_procfs_refs and ctl_header_refs
sysctl: rename sysctl_head_get/put to sysctl_proc_inode_get/put
sysctl: rename (un)use_table to __sysctl_(un)use_header
sysctl: simplify ->permissions hook
sysctl: group root-specific operations
sysctl: introduce ctl_table_group
sysctl: move removal from list out of start_unregistering


Part 4: new algorithm/data structures:
sysctl: faster tree-based sysctl implementation


Part 5: checks/warns requested during review:
sysctl: add duplicate entry and sanity ctl_table checks
sysctl: alloc ctl_table_header with kmem_cache
RFC: sysctl: change type of ctl_procfs_refs to u8
sysctl: check netns-specific registration order respected
sysctl: warn if registration/unregistration order is not respected
RFC: sysctl: always perform sysctl checks


Part 6: Eric requested rbtrees for subdirs. I also used rbtrees for
netns correspondent dirs. Hope that adding rbtrees after using the old
list-based implementation is good enough. The rbtree code complicates
things a bit and would uglify the patch adding the new algorithm.
sysctl: reorder members of ctl_table_header (cleanup)
sysctl: add ctl_type member
RFC: sysctl: replace subdirectory list with rbtree
RFC: sysctl: replace netns corresp list with rbtree
sysctl: union-ize some ctl_table_header fields


Part 7: Eric requested ability to register an empty dir:
sysctl: add register_sysctl_dir: register an empty sysctl directory


Part 8: unrequested feature I'd like to piggy back :)
sysctl: add ctl_cookie and ctl_cookie_handler
sysctl: add cookie to __register_sysctl_paths
sysctl: add register_net_sysctl_table_net_cookie


drivers/char/random.c | 27 +-
fs/eventpoll.c | 22 +-
fs/notify/inotify/inotify_user.c | 22 +-
fs/proc/inode.c | 2 +-
fs/proc/proc_sysctl.c | 233 +++++---
include/linux/inotify.h | 2 -
include/linux/key.h | 3 -
include/linux/poll.h | 2 -
include/linux/sysctl.h | 221 +++++---
include/net/net_namespace.h | 4 +-
init/main.c | 1 +
kernel/Makefile | 5 +-
kernel/sysctl.c | 1161 ++++++++++++++++++++++++++++----------
kernel/sysctl_check.c | 325 +++++++----
lib/Kconfig.debug | 8 -
net/sysctl_net.c | 86 ++--
security/keys/key.c | 7 +
security/keys/sysctl.c | 18 +-
18 files changed, 1495 insertions(+), 654 deletions(-)


--
 .
..: Lucian


2011-05-23 04:27:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [v3 00/39] faster tree-based sysctl implementation

Lucian Adrian Grijincu <[email protected]> writes:

> Hi
>
> This is version 3 of a patch series that introduces a faster/leaner
> sysctl internal implementation.
>
> Due to high number of patches and low general interest I'll just point
> you to the tree/branch:
>
> git://github.com/luciang/linux-2.6-new-sysctl.git v3-new-sysctl-alg
>
>
> Patches are on top of v2.6.39. I did not pick a more recent (random)
> point in Linus' tree to rebase these onto to not mess up testing.


Thanks for keeping going on this.

This patchset looks like it is deserving of some close scrutiny, and
not just the high level design overview I have given the previous
patches. This is going to be a busy week for me so I probably won't
get through all of the patches for a while.

I will mention a couple of nits I noticed while I was skimming through
your patches.
- There can be multiple proc superblocks and thus multiple inodes
referring to the same /proc/sys file, if there are multiple pid
namespaces.

- I have a hope to move /proc/sys into /proc/<pid>/sys so we don't have
to look at current to determine the namespace we want to display.
That would allow the deeply magic sysctl_is_seen check to be removed
from proc_sys_compare. That is not your problem, but of an
explanation why the namespaces are passed through.

Eric

2011-05-23 06:00:04

by Lucian Adrian Grijincu

[permalink] [raw]
Subject: Re: [v3 00/39] faster tree-based sysctl implementation

On Mon, May 23, 2011 at 7:27 AM, Eric W. Biederman
<[email protected]> wrote:
> I will mention a couple of nits I noticed while I was skimming through
> your patches.
> - There can be multiple proc superblocks and thus multiple inodes
>  referring to the same /proc/sys file, if there are multiple pid
>  namespaces.


OK. I'll revert the patch that make converts an int counter into an u8
and post an update. I guess this is what you were referring to.
https://github.com/luciang/linux-2.6-new-sysctl/commit/b7a547b8ce7484ae22eea34a860f674fcb19c11b


> - I have a hope to move /proc/sys into /proc/<pid>/sys so we don't have
>  to look at current to determine the namespace we want to display.
>  That would allow the deeply magic sysctl_is_seen check to be removed
>  from proc_sys_compare.  That is not your problem, but of an
>  explanation why the namespaces are passed through.


OK, I'll go an change things to pass the current namespace as it were
before my changes.


Thank you.
--
 .
..: Lucian

2011-05-23 06:37:58

by Lucian Adrian Grijincu

[permalink] [raw]
Subject: Re: [v3 00/39] faster tree-based sysctl implementation

On Mon, May 23, 2011 at 7:27 AM, Eric W. Biederman
<[email protected]> wrote:
> This patchset looks like it is deserving of some close scrutiny, and
> not just the high level design overview I have given the previous
> patches.  This is going to be a busy week for me so I probably won't
> get through all of the patches for a while.


I have one more question. The current implementation uses a single
sysctl_lock to synchronize all changes to the data structures.

In my algorithm I change a few places to use a per-header read-write
lock. Even though the code is organized to handle a per-header rwlock,
the implementation uses a single global rwlock. In v2 I got rid of the
rwlock and replaced the subdirs/files regular lists with rcu-protected
lists and that's why I did not bother giving each header a rwlock.


I have no idea how to use rcu with rbtree. Should I now give each
header it's own lock to reduce contention?


I'm asking this because I don't know why the only is a global sysctl
spin lock, when multiple locks could have been used, each to protect
it's own domain of values.

If you'd like to keep locking as simple as possible (to reduce all the
potential problems brought on by too many locks), or if in general
contention is low enough, then global lock is better. If not, then
I'll change the code to support per-header rwlocks (increasing the
ctl_table_header structure size).

--
 .
..: Lucian

2011-05-23 09:32:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [v3 00/39] faster tree-based sysctl implementation

Lucian Adrian Grijincu <[email protected]> writes:

> On Mon, May 23, 2011 at 7:27 AM, Eric W. Biederman
> <[email protected]> wrote:
>> This patchset looks like it is deserving of some close scrutiny, and
>> not just the high level design overview I have given the previous
>> patches.  This is going to be a busy week for me so I probably won't
>> get through all of the patches for a while.
>
>
> I have one more question. The current implementation uses a single
> sysctl_lock to synchronize all changes to the data structures.
>
> In my algorithm I change a few places to use a per-header read-write
> lock. Even though the code is organized to handle a per-header rwlock,
> the implementation uses a single global rwlock. In v2 I got rid of the
> rwlock and replaced the subdirs/files regular lists with rcu-protected
> lists and that's why I did not bother giving each header a rwlock.
>
>
> I have no idea how to use rcu with rbtree. Should I now give each
> header it's own lock to reduce contention?

I would only walk down that path if we can find some profile data
showing that the lock is where we are hot.

> I'm asking this because I don't know why the only is a global sysctl
> spin lock, when multiple locks could have been used, each to protect
> it's own domain of values.

Mostly it is simplicity. There is also the fact that the spin lock is
used in the implementation of something that is essentially a
reader/writer lock already.

With the help of the reference counts we block when we are unregistering
until there are no more users.

In that context I'm not certain I am comfortable with separating proc
inode usage from other proc usage. But I haven't read through that
section of your code well enough yet to tell if you are making sense.

One of the things that would be very nice to do is add lockdep
annotations like I have to sysfs_activate and sysfs_deactivate, so we
can catch the all too common case of someone unregistering a sysctl
table when there are problems.

Personally I'm not happy with the state of the locking abstractions in
sysctl today. It is all much too obscure, and there are too few
warnings. However for your set of changes I think the thing to focus
on is getting sysctl to better data structures so that it can scale.

Once the data structures are simple enough any remaining issues should
be fixable with small straight forward patches.

Eric

2011-05-23 13:27:09

by Lucian Adrian Grijincu

[permalink] [raw]
Subject: Re: [v3 00/39] faster tree-based sysctl implementation

On Mon, May 23, 2011 at 12:32 PM, Eric W. Biederman
<[email protected]> wrote:
> Mostly it is simplicity.  There is also the fact that the spin lock is
> used in the implementation of something that is essentially a
> reader/writer lock already.


The amount of time in which the spin lock is held in the current
implementation can be quite large: in __register_sysctl_paths:

https://github.com/mirrors/linux-2.6/blob/v2.6.39/kernel/sysctl.c#L1887

spin_lock(&sysctl_lock);
for (set = header->set; set; set = set->parent)
list_for_each_entry(p, &set->list, ctl_entry)
try_attach(p, header);
spin_unlock(&sysctl_lock);


For N=10^5 headers and try_attach=O(N) it's not a very good locking mechanism.

That's why I opted for a rwlock for each dir's subdirs/tables.


> In that context I'm not certain I am comfortable with separating proc
> inode usage from other proc usage. But I haven't read through that
> section of your code well enough yet to tell if you are making sense.


Proc inode usage (->count) was already separate from other proc usage (->use).
It was not separate from other header references (shared in ->count).

I separated the two because when I call unregister on a header I need
to decide whether to really unregister it (->unregistering=true and no
one can see this header and anything under it any more) or just
decrement a reference.

In the current implementation a header is only created by a
__register_sysctl_paths call and it's clear that at unregister we have
to set ->unregistering.
In my implementation headers are created dynamically to create new
directory elements. I need to know when to unregister such a header
regardless of any possible procfs inode references.

https://github.com/luciang/linux-2.6-new-sysctl/blob/v4-new-sysctl-alg/kernel/sysctl.c#L2390



I pushed a new version:
git://github.com/luciang/linux-2.6-new-sysctl.git v4-new-sysctl-alg

I undid int->u8 for ctl_procfs_refs.

I left the ->permissions hook get it's namespace form current->
because rewriting history for that change trips on too many patches
and a new parameter can be very easily added later when needed. Hope
this is ok with you.


I'd like to send patches for review to archs/drivers/etc. that
register only tables of files, not whole sysctl trees.
The patches don't depend on anything from this series.

Examples:
* http://thread.gmane.org/gmane.linux.kernel/1137032/focus=1137089
* http://thread.gmane.org/gmane.linux.kernel/1137032/focus=1137087


I'd like an OK-GO from you.

--
 .
..: Lucian