LinuxLists.cc - [PATCH] only irq-safe atomic ops

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Robert Love wrote:
>
> The following patch implements i386 versions of atomic_inc and
> atomic_dec that are LOCK-less but provide IRQ-atomicity and act as a
> memory-barrier.
>
> An applicable use would be data that needs to be IRQ-safe but not
> SMP-safe (or, more likely, is already SMP-safe for some other reason).
>
> Additionally, these variants could prevent having to use
> preempt_disable/enable or "full" atomic ops around per-CPU data with a
> preemptible kernel.
>

Thanks, Robert.

Some background here - for the delayed allocation code which I'm
cooking up I need to globally count the number of dirty pages in the
machine. Currently that's done with atomic_inc(&nr_dirty_pages)
in SetPageDirty().

But this counter (which is used for when-to-start-writeback decisions)
is unavoidably approximate. It would be nice to make it a per-CPU
array. So on the rare occasions when the dirty-page count is needed,
I can just whizz across the per-cpu counters adding them all up.

But how to increment or decrement a per-cpu counter? The options
are:

- per_cpu_integer++;

This is *probably* OK on all architectures. But there are no
guarantees that the compiler won't play games, and that this
operation is preempt-safe.

- preempt_disable(); per_cpu_counter++; preempt_enable();

A bit heavyweight for add-one-to-i.

- atomic_inc

A buslocked operation where it is not needed - we only need
a preempt-locked operation here. But it's per-cpu data, and
the buslocked rmw won't be too costly.

I can't believe how piddling this issue is :)

But if there's a general need for such a micro-optimisation
then we need to:

1: Create <linux/atomic.h> (for heavens sake!)

2: In <linux/atomic.h>,

#ifndef ARCH_HAS_ATOMIC_INQ_THINGIES
#define atomic_inc_irq atomic_inc
...
#endif

But for now, I suggest we not bother. I'll just use atomic_inc().

-

2002-02-23 07:30:02

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, 2002-02-23 at 01:51, Andrew Morton wrote:

> Thanks, Robert.

You are welcome.

> Some background here - for the delayed allocation code which I'm
> cooking up I need to globally count the number of dirty pages in the
> machine. Currently that's done with atomic_inc(&nr_dirty_pages)
> in SetPageDirty().
>
> But this counter (which is used for when-to-start-writeback decisions)
> is unavoidably approximate. It would be nice to make it a per-CPU
> array. So on the rare occasions when the dirty-page count is needed,
> I can just whizz across the per-cpu counters adding them all up.

Question: if (from below) you are going to use atomic operations, why
make it per-CPU at all? Just have one counter and atomic_inc and
atomic_read it. You won't need a spin lock.

> But how to increment or decrement a per-cpu counter? The options
> are:
>
> - per_cpu_integer++;
>
> This is *probably* OK on all architectures. But there are no
> guarantees that the compiler won't play games, and that this
> operation is preempt-safe.

This would be atomic and thus preempt-safe on any sane arch I know, as
long as we are dealing with a normal type int. Admittedly, however, we
can't be sure what the compiler would do.

Thinking about it, you are probably going to be doing this:

++counter[smp_processor_id()];

and that is not preempt-safe since the whole operation certainly is not
atomic. The current CPU could change between calculating it and
referencing the array. But, that wouldn't matter as long as you only
cared about the sum of the counters.

> - preempt_disable(); per_cpu_counter++; preempt_enable();
>
> A bit heavyweight for add-one-to-i.

Agreed, although I bet its not noticeable.

> - atomic_inc
>
> A buslocked operation where it is not needed - we only need
> a preempt-locked operation here. But it's per-cpu data, and
> the buslocked rmw won't be too costly.
>
> I can't believe how piddling this issue is :)
>
> But if there's a general need for such a micro-optimisation
> then we need to:
>
> 1: Create <linux/atomic.h> (for heavens sake!)
>
> 2: In <linux/atomic.h>,
>
> #ifndef ARCH_HAS_ATOMIC_INQ_THINGIES
> #define atomic_inc_irq atomic_inc
> ...
> #endif

I can think up a few more uses of the irq/memory-safe atomic ops, so I
bet this isn't that bad of an idea. But no point doing it without a
corresponding use.

> But for now, I suggest we not bother. I'll just use atomic_inc().

Robert Love

2002-02-23 07:56:31

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Robert Love wrote:
>
> ...
>
> Question: if (from below) you are going to use atomic operations, why
> make it per-CPU at all? Just have one counter and atomic_inc and
> atomic_read it. You won't need a spin lock.

Oh that works fine. But then it's a global counter, so each time
a CPU marks a page dirty, the counter needs to be pulled out of
another CPU's cache. Which is not a thing I *need* to do.

As I said, it's a micro-issue. But it's a requirement which
may pop up elsewhere.

> This would be atomic and thus preempt-safe on any sane arch I know, as
> long as we are dealing with a normal type int. Admittedly, however, we
> can't be sure what the compiler would do.
>
> Thinking about it, you are probably going to be doing this:
>
> ++counter[smp_processor_id()];
>
> and that is not preempt-safe since the whole operation certainly is not
> atomic. The current CPU could change between calculating it and
> referencing the array.

yup. It'd probably work - the compiler should calculate the address and
do a non-buslocked but IRQ-atomic increment on it. But no way can we
rely on that happening.

> But, that wouldn't matter as long as you only
> cared about the sum of the counters.

If the compiler produced code equivalent to

counter[smp_processor_id()] = counter[smp_processor_id()] + 1;

then the counter would get trashed - a context switch could cause CPUB
to write CPUA's counter (+1) onto CPUB's counter. It's quite possibly
illegal for the compiler to evaluate the array subscript twice in this
manner. Dunno.

If the compiler produced code equivalent to:

const int cpu = smp_processor_id();
counter[cpu] = counter[cpu] + 1;

(which is much more likely) then a context switch would result
in CPUB writing CPUA's updated counter onto CPUA's counter. Which
will work a lot more often, until CPUA happens to be updating its
counter at the same time.

> ...
> > 2: In <linux/atomic.h>,
> >
> > #ifndef ARCH_HAS_ATOMIC_INQ_THINGIES
> > #define atomic_inc_irq atomic_inc
> > ...
> > #endif
>
> I can think up a few more uses of the irq/memory-safe atomic ops, so I
> bet this isn't that bad of an idea. But no point doing it without a
> corresponding use.

Sure.

-

2002-02-23 09:38:40

by Russell King

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 02:29:48AM -0500, Robert Love wrote:
> This would be atomic and thus preempt-safe on any sane arch I know, as
> long as we are dealing with a normal type int. Admittedly, however, we
> can't be sure what the compiler would do.

Any sane RISC architecture probably doesn't have a single "increment this
memory location" instruction.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-02-23 11:39:48

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Fri, Feb 22, 2002 at 11:54:48PM -0800, Andrew Morton wrote:
> Robert Love wrote:
> > Thinking about it, you are probably going to be doing this:
> >
> > ++counter[smp_processor_id()];
> >
> > and that is not preempt-safe since the whole operation certainly is not
> > atomic. The current CPU could change between calculating it and
> > referencing the array.
>
> yup. It'd probably work - the compiler should calculate the address and
> do a non-buslocked but IRQ-atomic increment on it. But no way can we
> rely on that happening.
>
> > But, that wouldn't matter as long as you only
> > cared about the sum of the counters.
>
> If the compiler produced code equivalent to
>
> counter[smp_processor_id()] = counter[smp_processor_id()] + 1;
>
> then the counter would get trashed - a context switch could cause CPUB
> to write CPUA's counter (+1) onto CPUB's counter. It's quite possibly
> illegal for the compiler to evaluate the array subscript twice in this
> manner. Dunno.
>
> If the compiler produced code equivalent to:
>
> const int cpu = smp_processor_id();
> counter[cpu] = counter[cpu] + 1;
>
> (which is much more likely) then a context switch would result
> in CPUB writing CPUA's updated counter onto CPUA's counter. Which
> will work a lot more often, until CPUA happens to be updating its
> counter at the same time.

So without preemption in the kernel
maybe 4 instructions: calculate cpuid, inc; all local no cache ping
code is easy to read and understand.

with preemption in the kernel
a design problem. a slippery synchronization issue that
involves the characteristic preemption error - code that works
most of the time.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-02-23 18:20:24

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, 2002-02-23 at 06:38, [email protected] wrote:

> So without preemption in the kernel
> maybe 4 instructions: calculate cpuid, inc; all local no cache ping
> code is easy to read and understand.
>
> with preemption in the kernel
> a design problem. a slippery synchronization issue that
> involves the characteristic preemption error - code that works
> most of the time.

Or not. The topic of this thread was a micro-optimization. If we treat
the variable as anything normal requiring synchronization under SMP, any
of the standard solutions (atomic ops, etc.) work. If we want to get
fancy, we can disable preemption, use my atomic_irq ops, or just not
care.

Robert Love

2002-02-23 19:08:40

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 01:20:06PM -0500, Robert Love wrote:
> On Sat, 2002-02-23 at 06:38, [email protected] wrote:
>
> > So without preemption in the kernel
> > maybe 4 instructions: calculate cpuid, inc; all local no cache ping
> > code is easy to read and understand.
> >
> > with preemption in the kernel
> > a design problem. a slippery synchronization issue that
> > involves the characteristic preemption error - code that works
> > most of the time.
>
> Or not. The topic of this thread was a micro-optimization. If we treat
> the variable as anything normal requiring synchronization under SMP, any
> of the standard solutions (atomic ops, etc.) work. If we want to get

And cause cache ping, possible contention, ...

> fancy, we can disable preemption, use my atomic_irq ops, or just not
> care.

Right. Without preemption it is safe to do
c = smp_get_cpuid();
...
x = ++local_cache[c]
..

y = ++different_local_cache[c];
..

With preemption this turns into a problem that is easier to solve
with a lock or by not having per-cpu caches in the first place --
eg by writing worse code. After all, those are just micro-optimizations
and a few percent here, a few percent there, nobody will notice it. -(

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-02-23 20:00:23

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Andrew Morton wrote:

>Robert Love wrote:
>
>>...
>>
>>Question: if (from below) you are going to use atomic operations, why
>>make it per-CPU at all? Just have one counter and atomic_inc and
>>atomic_read it. You won't need a spin lock.
>>
>
>Oh that works fine. But then it's a global counter, so each time
>a CPU marks a page dirty, the counter needs to be pulled out of
>another CPU's cache. Which is not a thing I *need* to do.
>
>As I said, it's a micro-issue. But it's a requirement which
>may pop up elsewhere.
>
>
I can tell you that Irix has just such a global counter for the amount of
delayed allocate pages - and it gets to be a major point of cache contention
once you get to larger cpu counts. So avoiding that from the start would
be good.

Steve

2002-02-23 20:29:20

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Stephen Lord wrote:
>
> Andrew Morton wrote:
>
> >Robert Love wrote:
> >
> >>...
> >>
> >>Question: if (from below) you are going to use atomic operations, why
> >>make it per-CPU at all? Just have one counter and atomic_inc and
> >>atomic_read it. You won't need a spin lock.
> >>
> >
> >Oh that works fine. But then it's a global counter, so each time
> >a CPU marks a page dirty, the counter needs to be pulled out of
> >another CPU's cache. Which is not a thing I *need* to do.
> >
> >As I said, it's a micro-issue. But it's a requirement which
> >may pop up elsewhere.
> >
> >
> I can tell you that Irix has just such a global counter for the amount of
> delayed allocate pages - and it gets to be a major point of cache contention
> once you get to larger cpu counts. So avoiding that from the start would
> be good.

Ah, good info. Thanks. I'll fix it with a big "FIXME" comment for now,
fix it for real when Rusty's per-CPU infrastructure appears.

-

2002-02-23 20:56:46

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Andrew Morton <[email protected]> writes:
> > >
> > I can tell you that Irix has just such a global counter for the amount of
> > delayed allocate pages - and it gets to be a major point of cache contention
> > once you get to larger cpu counts. So avoiding that from the start would
> > be good.
>
> Ah, good info. Thanks. I'll fix it with a big "FIXME" comment for now,
> fix it for real when Rusty's per-CPU infrastructure appears.

Just curious -- how do you want to fix it for real?
As far as I can see a delalloc counter needs to be exact to avoid OOM
deadlocks, but making it per CPU would require doing the accounting inexact.

The only way I can imagine is to use the 'cookie jar' trick by grabbing
some amount per CPU and only going back to the global counter when the
local jar is used up, but that seems to have some drawbacks too, like
keeping a lot of pages free with higher CPU counts.

-Andi

2002-02-23 21:17:18

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Andrew Morton wrote:

>Andi Kleen wrote:
>
>>Andrew Morton <[email protected]> writes:
>>
>>>>I can tell you that Irix has just such a global counter for the amount of
>>>>delayed allocate pages - and it gets to be a major point of cache contention
>>>>once you get to larger cpu counts. So avoiding that from the start would
>>>>be good.
>>>>
>>>Ah, good info. Thanks. I'll fix it with a big "FIXME" comment for now,
>>>fix it for real when Rusty's per-CPU infrastructure appears.
>>>
>>Just curious -- how do you want to fix it for real?
>>As far as I can see a delalloc counter needs to be exact to avoid OOM
>>deadlocks, but making it per CPU would require doing the accounting inexact.
>>
>
>The counter is used only for making writeback decisions. It is
>completely analogous to nr_buffers_type[BUF_DIRTY], except it counts
>pages. So it does not need to be exact for readers.
>
>If we needed exact reader-accounting for the number of dirty pages in the
>machine then we'd need a ton of new locking in fun places like __free_pte(),
>and that still doesn't account for pages which are only pte-dirty, and it's
>not obvious what we'd do with reader-exact dirty page info anyway?
>
>
You do want to avoid a leak in one direction or the other, the os would
start to think it
had lots of dirty pages, but not be able to find them, or think there is
no shortage
when in fact there was.

Steve

2002-02-23 21:08:26

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Andi Kleen wrote:
>
> Andrew Morton <[email protected]> writes:
> > > >
> > > I can tell you that Irix has just such a global counter for the amount of
> > > delayed allocate pages - and it gets to be a major point of cache contention
> > > once you get to larger cpu counts. So avoiding that from the start would
> > > be good.
> >
> > Ah, good info. Thanks. I'll fix it with a big "FIXME" comment for now,
> > fix it for real when Rusty's per-CPU infrastructure appears.
>
> Just curious -- how do you want to fix it for real?
> As far as I can see a delalloc counter needs to be exact to avoid OOM
> deadlocks, but making it per CPU would require doing the accounting inexact.

The counter is used only for making writeback decisions. It is
completely analogous to nr_buffers_type[BUF_DIRTY], except it counts
pages. So it does not need to be exact for readers.

If we needed exact reader-accounting for the number of dirty pages in the
machine then we'd need a ton of new locking in fun places like __free_pte(),
and that still doesn't account for pages which are only pte-dirty, and it's
not obvious what we'd do with reader-exact dirty page info anyway?

-

2002-02-23 21:44:42

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Stephen Lord wrote:
>
> ...
> >If we needed exact reader-accounting for the number of dirty pages in the
> >machine then we'd need a ton of new locking in fun places like __free_pte(),
> >and that still doesn't account for pages which are only pte-dirty, and it's
> >not obvious what we'd do with reader-exact dirty page info anyway?
> >
> >
> You do want to avoid a leak in one direction or the other, the os would
> start to think it
> had lots of dirty pages, but not be able to find them, or think there is
> no shortage
> when in fact there was.

Oh absolutely. Even a one-page-per-hour leak would be catastrophic.

But there is a problem. If CPUA is always setting pages dirty,
and CPUB is always setting them clean, CPUA's counter will eventually
overflow, and CPUB's will underflow. Drat.

One fix for this would be special-case over- and under-flow handling:

if (TestSetPageDirty(page)) {
preempt_disable();
nr_dirty_pages[smp_processor_id()]++;
if (nr_dirty_pages[smp_processor_id()] > 2,000,000) {
fit_it_up();
}
preempt_enable();
}

void fix_it_up()
{
spin_lock(&global_counter_lock);
global_counter += 1,000,000;
nr_dirty_pages[smp_processor_id()] -= 1,000,000;
spin_unlock(&global_counter_lock);
}

int approx_total_dirty_pages()
{
int ret;

spin_lock(&global_counter_lock);
ret = global_counter;
for (all cpus) {
ret += nr_dirty_pages[cpu];
}
spin_unlock(&global_counter_lock);
return ret;
}

Or something like that.

Then again, maybe the underflows and overflows don't matter, because the
sum of all counters is always correct. Unless there's a counter roll-over
during the summation. No. Even then it's OK.

hmmm.

-

2002-02-23 21:57:34

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Hi,

[email protected] wrote:

> Right. Without preemption it is safe to do
> c = smp_get_cpuid();
> ...
> x = ++local_cache[c]
> ..
>
> y = ++different_local_cache[c];
> ..

Just add:
smp_put_cpuid();

Is that so much worse?

bye, Roman

2002-02-23 22:00:33

by John Levon

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 12:06:48PM -0700, [email protected] wrote:

> With preemption this turns into a problem that is easier to solve
> with a lock or by not having per-cpu caches in the first place --

whilst you're correct (pre-emption has made everything harder), there's
not much point moaning about it, since it's in linus' kernel now. So
why bother ?

regards
john

--
"Oh dear, more knobs."
- David Chase

2002-02-23 22:12:16

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Roman Zippel wrote:
>
> Hi,
>
> [email protected] wrote:
>
> > Right. Without preemption it is safe to do
> > c = smp_get_cpuid();
> > ...
> > x = ++local_cache[c]
> > ..
> >
> > y = ++different_local_cache[c];
> > ..
>
> Just add:
> smp_put_cpuid();
>
> Is that so much worse?
>

ooh. me likee.

#define smp_get_cpuid() ({ preempt_disable(); smp_processor_id(); })
#define smp_put_cpuid() preempt_enable()

Does rml likee?

-

2002-02-23 22:09:56

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Andrew Morton wrote:

>Stephen Lord wrote:
>
>>You do want to avoid a leak in one direction or the other, the os would
>>start to think it
>>had lots of dirty pages, but not be able to find them, or think there is
>>no shortage
>>when in fact there was.
>>
>
>Oh absolutely. Even a one-page-per-hour leak would be catastrophic.
>
>But there is a problem. If CPUA is always setting pages dirty,
>and CPUB is always setting them clean, CPUA's counter will eventually
>overflow, and CPUB's will underflow. Drat.
>
>One fix for this would be special-case over- and under-flow handling:
>
> if (TestSetPageDirty(page)) {
> preempt_disable();
> nr_dirty_pages[smp_processor_id()]++;
> if (nr_dirty_pages[smp_processor_id()] > 2,000,000) {
> fit_it_up();
> }
> preempt_enable();
> }
>
> void fix_it_up()
> {
> spin_lock(&global_counter_lock);
> global_counter += 1,000,000;
> nr_dirty_pages[smp_processor_id()] -= 1,000,000;
> spin_unlock(&global_counter_lock);
> }
>
> int approx_total_dirty_pages()
> {
> int ret;
>
> spin_lock(&global_counter_lock);
> ret = global_counter;
> for (all cpus) {
> ret += nr_dirty_pages[cpu];
> }
> spin_unlock(&global_counter_lock);
> return ret;
> }
>
>Or something like that.
>
>Then again, maybe the underflows and overflows don't matter, because the
>sum of all counters is always correct. Unless there's a counter roll-over
>during the summation. No. Even then it's OK.
>
>hmmm.
>
>-
>

As I was plumbing in a sink ;-) the thought also occurred that you could
have allocate
and free counters per cpu. The fix up code could do leveling between
them. Are you
sure you only want to look at the actual value infrequently though?
Every time you
put a page into delalloc state you need to do something similar to
balance_dirty(),
if you only poll the value periodically then it could get way out of
balance before
you noticed. It is very easy to dirty memory this way, cat /dev/zero >
xxxx runs
very quickly.

I would almost say you need to get a 'reservation' of a delalloc page
before you
grab the memory. There are extra memory costs associated with getting the
page out of this state, and if you do not hold threads back from
dirtying pages
you can end up in a situation where you cannot clean up again.

Steve

2002-02-23 22:25:33

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 02:10:25PM -0800, Andrew Morton wrote:
> Roman Zippel wrote:
> >
> > Hi,
> >
> > [email protected] wrote:
> >
> > > Right. Without preemption it is safe to do
> > > c = smp_get_cpuid();
> > > ...
> > > x = ++local_cache[c]
> > > ..
> > >
> > > y = ++different_local_cache[c];
> > > ..
> >
> > Just add:
> > smp_put_cpuid();
> >
> > Is that so much worse?
> >
>
> ooh. me likee.

Cool.
Me likee code with unmatched smp_get_cpuid/smp_put_cpuid.
Much nicer to write
x = ++local_cache[smp_getcpuid()];
smp_put_cuid();
than boring old
x = ++ local_cache[c];

Is this part of some scheme to make the GPL support model actually
pay?

c = smp_get_cpuid(); // disables preemption

...
f(); // oops, me forgotee, this function also references cpuid
..
x = ++local_cache[c]; // live dangerously
smp_put_cpuid(); // G_d knows what that does now.

Oh, wait, I know - reference counts for get_cpuid! How hard can that
be? See how simple it is? One simple step at a time.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-02-23 22:37:13

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Stephen Lord wrote:
>
> ...
> As I was plumbing in a sink ;-) the thought also occurred that you could
> have allocate and free counters per cpu. The fix up code could do leveling between
> them.

yup. If fixup code is needed.

> Are you sure you only want to look at the actual value infrequently though?
> Every time you put a page into delalloc state you need to do something similar to
> balance_dirty(),

balance_dirty_pages() :)

> if you only poll the value periodically then it could get way out of
> balance before you noticed. It is very easy to dirty memory this way,
> cat /dev/zero > xxxx runs very quickly.

At present, I'm calling balance_dirty_pages() once per 32 pages within
generic_file_write(). And also once on the way out of generic_file_write().
Even if a zillion threads were writing at the same time, someone will do
an allocation and the VM will then send a pdflush thread off to perform
some writeback. I think this is all OK.

One possible wart with delayed allocate is that we need to go off
and instantiate disk mappings at writepage() time, which may cause
extra memory squeezes. I'm not particularly fussed about that at
this time. It's a VM problem...

balance_dirty_pages does:

if (dirty_pages > sync_threshold)
sync_writeback();
else if (dirty_pages > async_threshold)
async_writeback();
else if (dirty_pages > background_threshold)
tell_a_pdflush_thread_to_do_some_writeback();

Current code is at http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.5/,
but I haven't released it yet :)

That code is presently shoving half-megabyte BIOs direct from
pagecache into the block layer. There are no buffer_heads used
at all in the write(2) path. One such BIO saves 128 buffer_head
allocations, 128 bio_allocs and a ton of request merging.

Unfortunately I seem to have found a bug in existing ext2, a bug
in existing block_write_full_page, a probable bug in the aic7xxx
driver, and an oops in the aic7xxx driver. So progress has slowed
down a bit :(

>
> I would almost say you need to get a 'reservation' of a delalloc page
> before you
> grab the memory. There are extra memory costs associated with getting the
> page out of this state, and if you do not hold threads back from
> dirtying pages
> you can end up in a situation where you cannot clean up again.
>

I reserve the space in the filesystem at ->prepare_write() time. If
there's no space, I force a writeback to cause all the outstanding
worst-case reservations to be collapsed into real disk mappings. If
there's still no space, prepare_write returns -ENOSPC.

I terms of memory use, I think balance_dirty_pages takes care of
everything. If there are too many dirty pages, the caller of
write(2) performs synchronous writeout, just as happens with the
legacy buffer_head-based writeout.

Early days yet.

-

2002-02-23 22:42:53

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

[email protected] wrote:
>
> Is this part of some scheme to make the GPL support model actually
> pay?

No, no, no. It's all the uncommented code which brings in the dollars.

> c = smp_get_cpuid(); // disables preemption
>
> ...
> f(); // oops, me forgotee, this function also references cpuid
> ..
> x = ++local_cache[c]; // live dangerously
> smp_put_cpuid(); // G_d knows what that does now.
>

preempt_disable() nests. It's not a problem.

-

2002-02-23 22:45:53

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 10:00:04PM +0000, John Levon wrote:
> On Sat, Feb 23, 2002 at 12:06:48PM -0700, [email protected] wrote:
>
> > With preemption this turns into a problem that is easier to solve
> > with a lock or by not having per-cpu caches in the first place --
>
> whilst you're correct (pre-emption has made everything harder), there's
> not much point moaning about it, since it's in linus' kernel now. So
> why bother ?
>

everyone needs a hobby.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-02-23 22:49:53

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 02:40:58PM -0800, Andrew Morton wrote:
> [email protected] wrote:
> >
> > Is this part of some scheme to make the GPL support model actually
> > pay?
>
> No, no, no. It's all the uncommented code which brings in the dollars.
>
> > c = smp_get_cpuid(); // disables preemption
> >
> > ...
> > f(); // oops, me forgotee, this function also references cpuid
> > ..
> > x = ++local_cache[c]; // live dangerously
> > smp_put_cpuid(); // G_d knows what that does now.
> >
>
> preempt_disable() nests. It's not a problem.

Ah, got the reference count already! Good progress.
Is it irq safe yet?

BTW: Robert, whats a good kernel version that you recommend with
preemption patch? I want to rerun some tests.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-02-23 23:07:35

by Mike Fedyk

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 02:34:49PM -0800, Andrew Morton wrote:
> Unfortunately I seem to have found a bug in existing ext2, a bug
> in existing block_write_full_page, a probable bug in the aic7xxx
> driver, and an oops in the aic7xxx driver. So progress has slowed
> down a bit :(

Are these bugs in 2.4 also?

2002-02-23 23:13:36

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, 2002-02-23 at 17:10, Andrew Morton wrote:

> ooh. me likee.
>
> #define smp_get_cpuid() ({ preempt_disable(); smp_processor_id(); })
> #define smp_put_cpuid() preempt_enable()
>
> Does rml likee?

Yah, that works.

All we need to do is wrap the smp_processor_id references (however many
- it does not matter) in preempt_disable ... preempt_enable. This does
that cleanly.

Robert Love

2002-02-23 23:46:12

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, 2002-02-23 at 18:13, Robert Love wrote:
> On Sat, 2002-02-23 at 17:10, Andrew Morton wrote:
>
> > ooh. me likee.
> >
> > #define smp_get_cpuid() ({ preempt_disable(); smp_processor_id(); })
> > #define smp_put_cpuid() preempt_enable()
> >
> > Does rml likee?
>
> Yah, that works.

OK, I still likee, but I was just thinking, if we are going to add have
to add something why not consider the irq-safe atomic ops? It is
certainly the most optimal.

Robert Love

2002-02-23 23:49:14

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Mike Fedyk wrote:
>
> On Sat, Feb 23, 2002 at 02:34:49PM -0800, Andrew Morton wrote:
> > Unfortunately I seem to have found a bug in existing ext2, a bug
> > in existing block_write_full_page, a probable bug in the aic7xxx
> > driver, and an oops in the aic7xxx driver. So progress has slowed
> > down a bit :(
>
> Are these bugs in 2.4 also?

The test case is:

- 120 megabyte filesystem, 1k blocksize ext2. SMP.
- run `dbench 64 2>/dev/null >/dev/null', as root (dunno
if rootness matters, but it affects ext2 allocation policy).

This really hammers the out-of-space handling.

We get a stream of `__block_prepare_write: zeroing uptodate buffer'
messages. I added that message. I shall work out whether there's
a bug, or if the message just gets killed. This happens in 2.4 also.

We get a handful of `VFS: brelse: Trying to free free buffer' messages,
coming from this code:

/* Next simple case - plain lookup or failed read of indirect block */
if (!create || err == -EIO) {
cleanup:
while (partial > chain) {
brelse(partial->bh);
partial--;
}

in ext2_get_block(). This only happens in 2.5. It happens on uniprocessor.
Possibly a bug introduced by the changed ext2 locking in 2.5. I have not
investigated further.

With my I/O scheduling changes the SCSI driver gets all upset about
data overruns or such. This may be my fault. It's a 2.5 issue.

During recovery from the data overrun, the SCSI driver oopses
over manipulation of a dodgy-looking timer struct.

-

2002-02-23 23:58:28

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Robert Love wrote:
>
> On Sat, 2002-02-23 at 18:13, Robert Love wrote:
> > On Sat, 2002-02-23 at 17:10, Andrew Morton wrote:
> >
> > > ooh. me likee.
> > >
> > > #define smp_get_cpuid() ({ preempt_disable(); smp_processor_id(); })
> > > #define smp_put_cpuid() preempt_enable()
> > >
> > > Does rml likee?
> >
> > Yah, that works.
>
> OK, I still likee, but I was just thinking, if we are going to add have
> to add something why not consider the irq-safe atomic ops? It is
> certainly the most optimal.
>

For the situation Victor described:

const int cpu = smp_get_cpuid();

per_cpu_array_foo[cpu] = per_cpu_array_bar[cpu] +
per_cpu_array_zot[cpu];

smp_put_cpuid();

It's a nice interface - it says "pin down and return the current
CPU ID".

-

2002-02-24 01:08:58

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, 2002-02-23 at 20:05, [email protected] wrote:

> Go for it! Only 900 odd calls to smp_processor_id in the kernel
> source last time I checked.

... and surprisingly few of them need explicit protection.

Robert Love

2002-02-24 01:07:28

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Sat, Feb 23, 2002 at 03:56:36PM -0800, Andrew Morton wrote:
> Robert Love wrote:
> >
> For the situation Victor described:
>
> const int cpu = smp_get_cpuid();
>
> per_cpu_array_foo[cpu] = per_cpu_array_bar[cpu] +
> per_cpu_array_zot[cpu];
>
> smp_put_cpuid();
>
> It's a nice interface - it says "pin down and return the current
> CPU ID".

Go for it! Only 900 odd calls to smp_processor_id in the kernel
source last time I checked.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-02-25 13:02:14

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Andrew Morton wrote:

>
>
>Unfortunately I seem to have found a bug in existing ext2, a bug
>in existing block_write_full_page, a probable bug in the aic7xxx
>driver, and an oops in the aic7xxx driver. So progress has slowed
>down a bit :(
>

Try this for the aic driver:

---
/export/xfs1/snapshots-2.5/linus-tree/linux/drivers/scsi/aic7xxx/aic7xxx_lin
ux.c Sun Feb 10 19:50:15 2002
+++ linux/drivers/scsi/aic7xxx/aic7xxx_linux.c Sun Jan 27 21:08:28 2002
@@ -1646,6 +1646,7 @@
scb->platform_data->xfer_len = 0;
ahc_set_residual(scb, 0);
ahc_set_sense_residual(scb, 0);
+ scb->sg_count = 0;
if (cmd->use_sg != 0) {
struct ahc_dma_seg *sg;
struct scatterlist *cur_seg;

You will need to use ignore white space on this, I had to use cut and
paste to
get it into email.

We hit this once bio got introduced and we maxed out the request size
for the
driver. Justin has the code in his next aic version.

Steve

2002-02-25 13:12:55

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Mon, Feb 25 2002, Stephen Lord wrote:
> Andrew Morton wrote:
>
> >
> >
> >Unfortunately I seem to have found a bug in existing ext2, a bug
> >in existing block_write_full_page, a probable bug in the aic7xxx
> >driver, and an oops in the aic7xxx driver. So progress has slowed
> >down a bit :(
> >
>
> Try this for the aic driver:

Steve,

DaveM's alternate fix has been in the 2.4 and 2.5 kernels for some time,
so given that Andrew is testing 2.5.5 this can't be the problem.

> We hit this once bio got introduced and we maxed out the request size
> for the driver. Justin has the code in his next aic version.

Just for the record, this was/is not a bio bug, it was an aic7xxx bug
that could trigger with or without bio.

--
Jens Axboe

2002-02-25 13:17:06

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Jens Axboe wrote:

>On Mon, Feb 25 2002, Stephen Lord wrote:
>
>>Andrew Morton wrote:
>>
>>>
>>>Unfortunately I seem to have found a bug in existing ext2, a bug
>>>in existing block_write_full_page, a probable bug in the aic7xxx
>>>driver, and an oops in the aic7xxx driver. So progress has slowed
>>>down a bit :(
>>>
>>Try this for the aic driver:
>>
>
>Steve,
>
>DaveM's alternate fix has been in the 2.4 and 2.5 kernels for some time,
>so given that Andrew is testing 2.5.5 this can't be the problem.
>

OK, I was not aware that went in, thanks.

>
>
>>We hit this once bio got introduced and we maxed out the request size
>>for the driver. Justin has the code in his next aic version.
>>
>
>Just for the record, this was/is not a bio bug, it was an aic7xxx bug
>that could trigger with or without bio.
>
Yep, bio just made it easier to get larger requests.

Steve

2002-02-25 19:44:00

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

Stephen Lord wrote:
>
> Yep, bio just made it easier to get larger requests.
>

Which promptly go kersplat when you feed them into
submit_bio():

BUG_ON(bio_sectors(bio) > q->max_sectors);

Given that I'm hand-rolling a monster bio, I need to know
when to wrap it up and send it off, to avoid creating a bio
which is larger than the target device will accept. I'm currently
using the below patch. Am I right that this is missing API
functionality, or did I miss something?

Also, I could not find a way of querying the size of the vector
at *bi_io_vec. This is also information which would be helpful
when building large scatter/gather lists.

--- 2.5.5/drivers/block/ll_rw_blk.c~mpio-10-biobits Mon Feb 25 00:28:21 2002
+++ 2.5.5-akpm/drivers/block/ll_rw_blk.c Mon Feb 25 00:28:21 2002
@@ -1350,6 +1350,28 @@ static void end_bio_bh_io_sync(struct bi
}

/**
+ * bio_max_bytes: return the maximum number of bytes which can be
+ * placed in a single bio for a particular device.
+ *
+ * @dev: the device's kdev_t
+ *
+ * Each device has a maximum permissible queue size, and bios may
+ * not cover more data than that.
+ *
+ * Returns -ve on error.
+ */
+int bio_max_bytes(kdev_t dev)
+{
+ request_queue_t *q;
+ int ret = -1;
+
+ q = blk_get_queue(dev);
+ if (q)
+ ret = (q->max_sectors << 9);
+ return ret;
+}
+
+/**
* submit_bio: submit a bio to the block device layer for I/O
* @rw: whether to %READ or %WRITE, or maybe to %READA (read ahead)
* @bio: The &struct bio which describes the I/O
--- 2.5.5/include/linux/bio.h~mpio-10-biobits Mon Feb 25 00:28:21 2002
+++ 2.5.5-akpm/include/linux/bio.h Mon Feb 25 00:28:21 2002
@@ -204,5 +204,6 @@ extern struct bio *bio_copy(struct bio *
extern inline void bio_init(struct bio *);

extern int bio_ioctl(kdev_t, unsigned int, unsigned long);
+extern int bio_max_bytes(kdev_t dev);

#endif /* __LINUX_BIO_H */

-

2002-02-25 19:49:31

[permalink] [raw]

Subject: Re: [PATCH] only irq-safe atomic ops

On Mon, 2002-02-25 at 13:42, Andrew Morton wrote:
> Stephen Lord wrote:
> >
> > Yep, bio just made it easier to get larger requests.
> >
>
> Which promptly go kersplat when you feed them into
> submit_bio():
>
> BUG_ON(bio_sectors(bio) > q->max_sectors);
>
> Given that I'm hand-rolling a monster bio, I need to know
> when to wrap it up and send it off, to avoid creating a bio
> which is larger than the target device will accept. I'm currently
> using the below patch. Am I right that this is missing API
> functionality, or did I miss something?
>

I don't run into that one, but probably because I limit xfs to
use BIO_MAX_SECTORS, take a look at ll_rw_kio to see how that
splits things up. This of course does not take into account
any further restriction in an underlying queue.

Steve

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2002-02-25 20:09:44