Date: Thu, 18 Mar 2004 18:58:55 +0100
From: Andrea Arcangeli <andrea@suse.de>
To: Andrew Morton <akpm@osdl.org>
Cc: mjy@geizhals.at, linux-kernel@vger.kernel.org
Subject: Re: CONFIG_PREEMPT and server workloads
Message-ID: <20040318175855.GB2536@dualathlon.random>
References: <40591EC1.1060204@geizhals.at> <20040318060358.GC29530@dualathlon.random> <20040318015004.227fddfb.akpm@osdl.org> <20040318145129.GA2246@dualathlon.random> <20040318093902.3513903e.akpm@osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040318093902.3513903e.akpm@osdl.org>
User-Agent: Mutt/1.4.1i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4173
Lines: 131

On Thu, Mar 18, 2004 at 09:39:02AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > > > Worst of all we're now taking spinlocks earlier than needed,
> > > 
> > > Where?  CPU scheduler?
> > 
> > Everywhere, see the kmaps, we spinlock before instead of spinlock after,
> > the scheduler, lots of places. I mean, people don't call
> > 
> > 	preempt_disable()
> > 	kmap_atomic
> > 	spin_lock
> > 
> > they do:
> > 
> > 	spin_lock
> > 	kmap_atomic
> 
> They do?   kmap_atomic() disables preemption anyway.

dunno why but see:

	spin_lock(&mm->page_table_lock);
	page_table = pte_offset_map(pmd, address);

		pte_unmap(page_table);
		spin_unlock(&mm->page_table_lock);


etc...

maybe at some point in the past kmap_atomic didn't disable preempt? Just
guessing.

> > so they're effectively optimizing for PREEMPT=y and I don't think this
> > is optimal for the long term. One can aruge the microscalability
> > slowdown isn't something to worry about, I certainly don't worry about
> > it too much either, it's more a bad coding habit to spinlock earlier
> > than needed to avoid preempt_disable.
> > 
> > > > and the preempt_count stuff isn't optmized away by PREEMPT=n,
> > > 
> > > It should be.  If you see somewhere where it isn't, please tell us.
> > 
> > the counter is definitely not optimized away, see:
> > 
> > #define inc_preempt_count() \
> > do { \
> > 	preempt_count()++; \
> > } while (0)
> > 
> > #define dec_preempt_count() \
> > do { \
> > 	preempt_count()--; \
> > } while (0)
> > 
> > #define preempt_count()	(current_thread_info()->preempt_count)
> > 
> > those are running regardless of PREEMPT=n.
> 
> The macros are needed for kmap_atomic().
> 
> The task->preempt field is also used for tracking the hardirq and softirq
> depths (in_interrupt(), in_irq(), in_softirq()) so it cannot be removed
> with CONFIG_PREEMPT=n.
> 
> > > We unconditionally bump the preempt_count in kmap_atomic() so that we can
> > > use atomic kmaps in read() and write().  This is why four concurrent
> > > write(fd, 1, buf) processes on 4-way is 8x faster than on 2.4 kernels.
> > 
> > sorry, why should the atomic kmaps read the preempt_count? Are those ++
> > -- useful for anything more than debugging PREEMPT=y on a kernel
> > compiled with PREEMPT=n? I thought it was just debugging code with
> > PREEMPT=n.
> > 
> > I know why the atomic kmaps speedup write but I don't see how can
> > preempt_count help there when PREEMPT=n, the atomic kmaps are purerly
> > per-cpu and one can't schedule anyways while taking those kmaps (no
> > matter if inc_preempt_count or not).
> 
> We run inc_prempt_count() in kmap_atomic() so that we can perform
> copy_*_user() while holding an atomic kmap in read() and write(). 
> do_page_fault() will detect the elevated preempt count and will return a
> short copy.
> 
> This could be implemented with a separate task_struct field but given that
> preempt_count is always there and the infrastructure exists anyway, using
> preempt_count() is tidier.
> 

thanks for the detailed explanation, so I was definitely wrong about
claiming there was more stuff to optimize away with preempt=n, clearly
it's a great feature your atomic copy-user interrupted by the page
fault. Infact I wonder if we should try once more time and go atomic
again. what you're doing right now is:

	kmap_atomic()
	left = copy_user
	kunmap_atomic
	if (left) {
		kmap() <- persistent unscalable
		copy-user
		kunmap
	}
		
I would suggest we should be even more aggressive like this:

        kmap_atomic()
        left = copy_user
        kunmap_atomic
        if (left) {
		get_user()
        	kmap_atomic()
	        left = copy_user
	        kunmap_atomic
		if (left) {
	                kmap() <- persistent unscalable
        	        copy-user
                	kunmap
		}
        }


It's not going to trigger often anyways, but it's only a few bytecodes
more and it sounds more scalable.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/