MIME-Version: 1.0
In-Reply-To: <20150526114356.609107918@infradead.org>
References: <20150526114356.609107918@infradead.org>
Date: Tue, 26 May 2015 11:12:04 -0700
Message-ID: <CA+55aFwxjAWTm9XZg0ZE6kB70S8iKV6b8RQL8Bi6fapaPbwpRQ@mail.gmail.com>
Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>,
        Paul McKenney <paulmck@linux.vnet.ibm.com>, Tejun Heo <tj@kernel.org>,
        Ingo Molnar <mingo@redhat.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        der.herr@hofr.at, Davidlohr Bueso <dave@stgolabs.net>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2907
Lines: 79

On Tue, May 26, 2015 at 4:43 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> This is a derived work of the cpu hotplug lock rework I did in 2013 which never
> really went anywhere because Linus didn't like it.
>
> This applies those same optimizations to the percpu-rwsem. Seeing how we did
> all the work it seemed a waste to not use it at all.

So I *still* don't like it.

We literally have one single percpu-rwsem IN THE WHOLE KERNEL TREE.

One.

Not "a couple". Not "hundreds". ONE.

And that single use that you are optimizing for isn't even explained.
It's not really even clear that that thing is needed: fork() already
takes the mmap_sem for writing, and I'm not at all convinced that the
uprobes code couldn't just use mmap_sem for every mm it traverses,
rather than use this global percpu lock that we use absolutely nowhere
else.

So I'm not convinced that the right thing to do is to try to optimize
that lock at ALL. I'd rather get rid of it entirely.

Is there some new use that I don't know about? Have people really
looked at that uprobes code deeply? OF COURSE global locks will have
problems, I'm not at all convinced that "let's make that global lock
really complicated and clever" is the proper solution.

For example, the write lock is only ever taken when registering an
uprobe. Not exactly likely to be timing-critical. Is there some reason
why we couldn't get rid of the magical per-cpu rwsem entirely, and
replace it with something truly simple like a hashed rmsem, where the
readers (fork), will just read-lock one hashed rwsem, and the writer
will just write-lock every hashed rwsem.

The hash might be "pick the currently executing CPU at the time of
fork(), modulo 16". That one doesn't even need to disable preemption,
since it doesn't actually matter that the CPU stays constant, it's
just a trivial heurisitic to avoid cacheline pingpongs under very
heavy fork() load.

So we'd have

    #define NR_UPROBE_RWSEM 16

    // One cacheline per rwsem.
    struct cache_line_aligned_rw_semaphore uprobe_rwsem[NR_UPROBE_RWSEM];

and fork() would basically do

    int idx = raw_smp_processor_id() & (NR_UPROBE_RWSEM-1);
    struct rw_sem *usem = &uprobe_rwsem[idx].rwsem;
     ..
    down_read(usem);
    ... do the dup_mmap() here ..
    up_read(usem);

and the uprobe register thing could just do

    for (i = 0; i < NR_UPROBE_RWSEM; i++)
        down_write(&uprobe_rwsem[i].rwsem);
    ...
    for (i = 0; ... unlock them all ..)

which doesn't look any slower than what we do now (I suspect the
fork() part would be faster), and is much simpler, and would get rid
of the percpu-rwsem entirely.

Hmm?

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/