2002-08-22 02:14:00

by Andrew Morton

[permalink] [raw]
Subject: MM patches against 2.5.31


I've uploaded a rollup of pending fixes and feature work
against 2.5.31 to

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

The rolled up patch there is suitable for ongoing testing and
development. The individual patches are in the broken-out/
directory and should all be documented.


broken-out/linus.patch
Incremental BK patch from Linus' tree

broken-out/page_reserved.patch
Test PageReserved in pagevec_release()

broken-out/scsi_hack.patch
Fix block-highmem for scsi

broken-out/page_cache_release_lru_fix.patch
Fix a race between __page_cache_release() and shrink_cache().

broken-out/page_cache_release_fix.patch
Fix __page_cache_release() bugs

broken-out/mvm.patch
Fix vmalloc bugs

broken-out/pte-chain-fix.patch
Fix a VM lockup on uniprocessors

broken-out/func-fix.patch
gcc-2.91.66 does not support __func__

broken-out/ext3-htree.patch
Indexed directories for ext3

broken-out/rmap-mapping-BUG.patch
Fix a BUG_ON(page->mapping == NULL) in try_to_unmap()

broken-out/misc.patch
misc fixlets

broken-out/tlb-speedup.patch
Reduce typical global TLB invalidation frequency by 35%

broken-out/buffer-slab-align.patch
Don't align the buffer_head slab on hardware cacheline boundaries

broken-out/zone-rename.patch
Rename zone_struct->zone, zonelist_struct->zonelist. Remove zone_t,
zonelist_t.

broken-out/per-zone-lru.patch
Per-zone page LRUs

broken-out/per-zone-lock.patch
Per-zone LRU list locking

broken-out/l1-max-size.patch
Infrastructure for determining the maximum L1 cache size which the kernel
may have to support.

broken-out/zone-lock-alignment.patch
Pad struct zone to ensure that the lru and buddy locks are in separate
cachelines.

broken-out/put_page_cleanup.patch
Clean up put_page() and page_cache_release().

broken-out/anon-batch-free.patch
Batched freeing and de-LRUing of anonymous pages

broken-out/writeback-sync.patch
Writeback fixes and tuneups

broken-out/ext3-inode-allocation.patch
Fix an ext3 deadlock

broken-out/ext3-o_direct.patch
O_DIRECT support for ext3.

broken-out/jfs-bio.patch
Convert JFS to use direct-to-BIO I/O

broken-out/discontig-paddr_to_pfn.patch
Convert page pointers into pfns for i386 NUMA

broken-out/discontig-setup_arch.patch
Rework setup_arch() for i386 NUMA

broken-out/discontig-mem_init.patch
Restructure mem_init for i386 NUMA

broken-out/discontig-i386-numa.patch
discontigmem support for i386 NUMA

broken-out/cleanup-mem_map-1.patch
Clean up lots of open-coded uese of mem_map[]. For ia32 NUMA

broken-out/zone-pages-reporting.patch
Fix the boot-time reporting of each zone's available pages

broken-out/enospc-recovery-fix.patch
Fix the __block_write_full_page() error path.

broken-out/fix-faults.patch
Back out the initial work for atomic copy_*_user()

broken-out/bkl-consolidate.patch
Consolidation per-arch lock_kenrel() implementations.

broken-out/might_sleep.patch
Infrastructure to detect sleep-inside-spinlock bugs

broken-out/spin-lock-check.patch
spinlock/rwlock checking infrastructure

broken-out/atomic-copy_user.patch
Support for atomic copy_*_user()

broken-out/kmap_atomic_reads.patch
Use kmap_atomic() for generic_file_read()

broken-out/kmap_atomic_writes.patch
Use kmap_atomic() for generic_file_write()

broken-out/config-PAGE_OFFSET.patch
Configurable kenrel/user memory split

broken-out/throttling-fix.patch
Fix throttling of heavy write()rs.

broken-out/dirty-state-accounting.patch
Make the global dirty memory accounting more accurate

broken-out/rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)


2002-08-22 11:23:59

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
>
> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to
>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
>
> The rolled up patch there is suitable for ongoing testing and
> development. The individual patches are in the broken-out/
> directory and should all be documented.

Sorry, but we still have the page release race in multiple places.
Look at the following (page starts with page_count == 1):

Processor 1 Processor 2
refill_inactive: lines 378-395
as page count == 1 we'll
continue with line 401

__pagevec_release: line 138
calls release_pages
release_pages: line 100-111
put_page_test_zero brings the
page count to 0 and we'll continue
at line 114. Note that this may
happen while another processor holds
the lru lock, i.e. there is no
point in checking for page count == 0
with the lru lock held because
the lru lock doesn't protect against
decrements of page count after
the check.
line 401: page_cache_get
resurrects the page, page
count is now 1.
lines 402-448.
line 448 calls __pagevec_release

__pagevec_release: line 138
calls release_pages
release_pages: lines 100-111
put_page_test_zero brings the
page count back to 0 (!!!)
i.e. we continue at line 114:

lines 114-123.
The page count == 0 check in line
123 is successful and the page
is returned to the buddy allocator

lines 114-123.
The page count == 0 check in line
123 is successful, i.e. the page
is returned to the buddy allocator
a second time. ===> BOOM


Neither the lru lock nor any of the page count == 0 checks can
prevent this from happening.

regards Christian

--
THAT'S ALL FOLKS!

2002-08-22 15:58:38

by Steven Cole

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Wed, 2002-08-21 at 20:29, Andrew Morton wrote:
> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to
>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
>
> The rolled up patch there is suitable for ongoing testing and
> development. The individual patches are in the broken-out/
> directory and should all be documented.

The good news: I ran my dbench 1..128 stress test and for the first
time since 2.5.31-vanilla there were _no_ BUG()s reported at all.

The other news: from dmesg:
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,3), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald: page allocation failure. order:0, mode:0x0

The kjournald failure message came out with dbench 48 running on an ext3
partition. The test continued with only this one instance of this
message.

Steven

2002-08-22 16:03:33

by Martin J. Bligh

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

> kjournald: page allocation failure. order:0, mode:0x0

I've seen this before, but am curious how we ever passed
a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
that does this?

Thanks,

M.

2002-08-22 19:45:11

by Steven Cole

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Thu, 2002-08-22 at 10:06, Martin J. Bligh wrote:
> > kjournald: page allocation failure. order:0, mode:0x0
>
> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?
>
> Thanks,
>
> M.

I ran dbench 1..128 on 2.5.31-mm1 several more times with nothing
unusual happening, and then got this from pdflush with dbench 96.

pdflush: page allocation failure. order:0, mode:0x0

FWIW, this 2.5.31-mm1 kernel is SMP, HIGHMEM4G, no PREEMPT.

Steven

2002-08-26 01:37:27

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Christian Ehrhardt wrote:
>
> On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> >
> > I've uploaded a rollup of pending fixes and feature work
> > against 2.5.31 to
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> >
> > The rolled up patch there is suitable for ongoing testing and
> > development. The individual patches are in the broken-out/
> > directory and should all be documented.
>
> Sorry, but we still have the page release race in multiple places.
> Look at the following (page starts with page_count == 1):
>

So we do. It's a hugely improbable race, so there's no huge rush
to fix it. Looks like the same race is present in -ac kernels,
actually, if add_to_swap() fails. Also perhaps 2.4 is exposed if
swap_writepage() is synchronous, and page reclaim races with
zap_page_range. ho-hum.


What I'm inclined to do there is to change __page_cache_release()
to not attempt to free the page at all. Just let it sit on the
LRU until page reclaim encounters it. With the anon-free-via-pagevec
patch, very, very, very few pages actually get their final release in
__page_cache_release() - zero on uniprocessor, I expect.

And change pagevec_release() to take the LRU lock before dropping
the refcount on the pages.

That means there will need to be two flavours of pagevec_release():
one which expects the pages to become freeable (and takes the LRU
lock in anticipation of this). And one which doesn't expect the
pages to become freeable. The latter will leave the occasional
zero-count page on the LRU, as above.

Sound sane?

2002-08-26 02:00:13

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

"Martin J. Bligh" wrote:
>
> > kjournald: page allocation failure. order:0, mode:0x0
>
> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?

Could be anywhere, really. A network interrupt doing GFP_ATOMIC
while kjournald is executing. A radix-tree node allocation
on the add-to-swap path perhaps. (The swapout failure messages
aren't supposed to come out, but mempool_alloc() stomps on the
caller's setting of PF_NOWARN.)

Or:

mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
89

2002-08-26 02:05:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

>> > kjournald: page allocation failure. order:0, mode:0x0
>>
>> I've seen this before, but am curious how we ever passed
>> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
>> that does this?
>
> Could be anywhere, really. A network interrupt doing GFP_ATOMIC
> while kjournald is executing. A radix-tree node allocation
> on the add-to-swap path perhaps. (The swapout failure messages
> aren't supposed to come out, but mempool_alloc() stomps on the
> caller's setting of PF_NOWARN.)
>
> Or:
>
> mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> 89

No, GFP_ATOMIC is not 0:

#define __GFP_HIGH 0x20 /* Should access emergency pools? */
#define GFP_ATOMIC (__GFP_HIGH)

Looking at all the options:

#define __GFP_WAIT 0x10 /* Can wait and reschedule? */
#define __GFP_HIGH 0x20 /* Should access emergency pools? */
#define __GFP_IO 0x40 /* Can start low memory physical IO? */
#define __GFP_HIGHIO 0x80 /* Can start high mem physical IO? */
#define __GFP_FS 0x100 /* Can call down to low-level FS? */

What worries me is that 0 seems to mean "you can't do anything
to try and free it, but you can't access the emergency pools either".
Seems doomed to failure to me. And the standard sets we have are

#define GFP_NOHIGHIO ( __GFP_WAIT | __GFP_IO)
#define GFP_NOIO ( __GFP_WAIT)
#define GFP_NOFS ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO)
#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_USER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM)
#define GFP_KERNEL ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_NFS ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_KSWAPD ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)

So I think someone's screwed something up, and this is accidental.
Or I'm just totally misunderstanding this, which is perfectly
possible.

M.

2002-08-26 02:17:06

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

"Martin J. Bligh" wrote:
>
> >> > kjournald: page allocation failure. order:0, mode:0x0
> >>
> >> I've seen this before, but am curious how we ever passed
> >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> >> that does this?
> >
> > Could be anywhere, really. A network interrupt doing GFP_ATOMIC
> > while kjournald is executing. A radix-tree node allocation
> > on the add-to-swap path perhaps. (The swapout failure messages
> > aren't supposed to come out, but mempool_alloc() stomps on the
> > caller's setting of PF_NOWARN.)
> >
> > Or:
> >
> > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> > 89
>
> No, GFP_ATOMIC is not 0:
>

It's mempool_alloc(GFP_NOIO) or such. mempool_alloc() strips
__GFP_WAIT|__GFP_IO on the first attempt.

It also disables the printk, so maybe I just dunno ;) show_stack()
would tell.

2002-08-26 03:03:25

by Steven Cole

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Sun, 2002-08-25 at 20:32, Andrew Morton wrote:
> "Martin J. Bligh" wrote:
> >
> > >> > kjournald: page allocation failure. order:0, mode:0x0
> > >>
> > >> I've seen this before, but am curious how we ever passed
> > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> > >> that does this?
> > >
> > > Could be anywhere, really. A network interrupt doing GFP_ATOMIC
> > > while kjournald is executing. A radix-tree node allocation
> > > on the add-to-swap path perhaps. (The swapout failure messages
> > > aren't supposed to come out, but mempool_alloc() stomps on the
> > > caller's setting of PF_NOWARN.)
> > >
> > > Or:
> > >
> > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> > > 89
> >
> > No, GFP_ATOMIC is not 0:
> >
>
> It's mempool_alloc(GFP_NOIO) or such. mempool_alloc() strips
> __GFP_WAIT|__GFP_IO on the first attempt.
>
> It also disables the printk, so maybe I just dunno ;) show_stack()
> would tell.
>

The "kjournald: page allocation failure. order:0, mode:0x0" message and
"pdflush: page allocation failure. order:0, mode:0x0" occurred only once
each on my dual p3 scsi ext3 test box running 2.5.31-mm1. So, I added
something like this:
--- page_alloc.c.orig Thu Aug 22 17:27:32 2002
+++ page_alloc.c Thu Aug 22 17:29:24 2002
@@ -388,6 +388,8 @@
printk("%s: page allocation failure."
" order:%d, mode:0x%x\n",
current->comm, order, gfp_mask);
+ if (gfp_mask == 0)
+ BUG();
}
return NULL;
}

and continued testing on Friday with no repeats of the "page allocation failure"
messages. I obtained a second dual p3 ext3 test box (ide this time) and left both
boxes running 2.5.31-mm1 and the dbench 1..128 stress test scripted to rerun many
times over the weekend. Due to a couple of firewalls, I can't look at those boxes
from here, but I'll let you know what happened in about 10 to 11 hours.

Cheers,
Steven

2002-08-26 09:06:26

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Sun, Aug 25, 2002 at 06:52:55PM -0700, Andrew Morton wrote:
> Christian Ehrhardt wrote:
> >
> > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> > >
> > > I've uploaded a rollup of pending fixes and feature work
> > > against 2.5.31 to
> > >
> > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> > >
> > > The rolled up patch there is suitable for ongoing testing and
> > > development. The individual patches are in the broken-out/
> > > directory and should all be documented.
> >
> > Sorry, but we still have the page release race in multiple places.
> > Look at the following (page starts with page_count == 1):
> >
>
> So we do. It's a hugely improbable race, so there's no huge rush
> to fix it.

Actually the race seems to happen in real life (it does explain the
the pte.chain != NULL BUG) and it is not that improbable with preempt.

> Looks like the same race is present in -ac kernels,
> actually, if add_to_swap() fails. Also perhaps 2.4 is exposed if
> swap_writepage() is synchronous, and page reclaim races with
> zap_page_range. ho-hum.

I didn't check each kernel, but I agree that most of the recent kernels
have the potential for this race. Your tree just happend to be the
one where I found it first.

> What I'm inclined to do there is to change __page_cache_release()
> to not attempt to free the page at all. Just let it sit on the
> LRU until page reclaim encounters it. With the anon-free-via-pagevec
> patch, very, very, very few pages actually get their final release in
> __page_cache_release() - zero on uniprocessor, I expect.

It's not just a Problem with __page_cache_release, but yes it seems
to be a SMP only race.

> And change pagevec_release() to take the LRU lock before dropping
> the refcount on the pages.
>
> That means there will need to be two flavours of pagevec_release():
> one which expects the pages to become freeable (and takes the LRU
> lock in anticipation of this). And one which doesn't expect the
> pages to become freeable. The latter will leave the occasional
> zero-count page on the LRU, as above.
>
> Sound sane?

So the rules would be:
* if you bring the page count to zero call __free_pages_ok unless the
page is on the lru.
* if someone (reclaim page) walking the lru finds a page with page count zero
remove it from the lru and call __free_pages_ok.

This requires that ANYTHING that ends up calling put_page_testzero
must happen under the lru lock. This doesn't sound like a good idea
but it seems to be possible to do it race free.

I'd actually go for the following (patch in it compiles state but
otherwise untested below to illustrate what I'm talking about):

The basic idea is to move the logic into page_cache_get. The rules
would be:
1. if you bring the page count to zero (using put_page_testzero) it
is your job to call __free_pages_ok eventually. Before doing so
make sure that the page is no longer on the lru.
2. You may only call page_cache_get to duplicate an existing reference,
i.e. page_cache_get could be made to BUG_ON(page_count(page) == 0).
3. If you got a pointer to a page without holding a reference (this
is only allowd to happen if we found the pointer on an lru list)
call page_cache_get_lru UNDER the lru lock and just leave the page
alone if that would resurrect the page. page_cache_get_lru would
basically look like this (implementation details below):

int page_cache_get_lru (struct page * page) {
if (!atomic_inc_and_test_for_one (&page->count))
return 1;
atomic_dec (&page->count);
return 0;
}

Proof of correctness:
A page is called dead if its page count reached zero before (no matter
what the page count currently is). Once a page is dead there can be
at most two pointers to the page: One held by the lru and the other
one held by the thread freeing the page. Any thread accessing the page
via the lru list will first call page_cache_get_lru under the lru lock,
the thread freeing the page will not read the page count anymore.
As page_cache_get_lru will not resurrect the page there will never
be a page count != 0 visible outside the lru lock on a dead page.
This meas that each thread trying to access the dead page via the lru
list will detect that the page is dead and leave it alone. It follows
that each page is freed at most once.
Suppose a page could be leaked under these rules. This would require
someone calling __put_page (or atomic_dec (&page->count)) to bring the
page count to zero on a not already dead page. However, the only place
where this happens is in page_cache_get_lru and it only happens if the
page is already dead.

Now let's look at the ugly part: implementation.
The basic problem is that we don't habe an atomic_inc_and_test_for_one
function and it is unlikely that we'll get one on all architectures. The
solution (and this is the ugly part) is a change in the semantics of
page->count. The value -1 now means no reference, 0 means one reference etc.

This way we can define
put_page_testzero as atomic_add_negative (-1, &page->count); and
get_page_testzero as atomic_inc_and_test (&page->count);

Here's the promised (untested) patch against bk7 to illustrate what
I'm talking about:

diff -ur linux-2.5.31-bk7/include/linux/mm.h linux-2.5.31-cae/include/linux/mm.h
--- linux-2.5.31-bk7/include/linux/mm.h Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/mm.h Sun Aug 25 21:40:57 2002
@@ -184,6 +184,11 @@
/*
* Methods to modify the page usage count.
*
+ * NOTE: Real page counts start at -1 for no reference. This is a hack
+ * to be able to implement get_page_testzero with the existing portable
+ * atomic functions. The value exposed via set_page_count and page_count
+ * is (1+page->count).
+ *
* What counts for a page usage:
* - cache mapping (page->mapping)
* - private data (page->private)
@@ -192,12 +197,25 @@
*
* Also, many kernel routines increase the page count before a critical
* routine so they can be sure the page doesn't go away from under them.
+ *
+ * A special Problem is the lru lists. Presence on one of these lists
+ * does not increase the page count. The FIRST thread that brings the
+ * page count back to zero is responsible to remove the page from the
+ * lru list and actually free it (__free_pages_ok). This means that we
+ * can only get a reference to a page that is on a lru list, if this
+ * page is not already dead, i.e. about to be removed from the lru list.
+ * To do this we call get_page_testzero which will increment the page
+ * count and return true if we just resurrected the page i.e. the real
+ * page->count is now zero indicating one user. In this case we drop
+ * the reference again using __put_page. Both calls must happen under
+ * the lru lock.
*/
#define get_page(p) atomic_inc(&(p)->count)
#define __put_page(p) atomic_dec(&(p)->count)
-#define put_page_testzero(p) atomic_dec_and_test(&(p)->count)
-#define page_count(p) atomic_read(&(p)->count)
-#define set_page_count(p,v) atomic_set(&(p)->count, v)
+#define put_page_testzero(p) atomic_add_negative(-1, &(p)->count)
+#define page_count(p) (1+atomic_read(&(p)->count))
+#define set_page_count(p,v) atomic_set(&(p)->count, v-1)
+#define get_page_testzero(p) atomic_inc_and_test(&(p)->count)
extern void FASTCALL(__page_cache_release(struct page *));
#define put_page(p) \
do { \
diff -ur linux-2.5.31-bk7/include/linux/pagemap.h linux-2.5.31-cae/include/linux/pagemap.h
--- linux-2.5.31-bk7/include/linux/pagemap.h Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/pagemap.h Sun Aug 25 21:56:30 2002
@@ -22,7 +22,43 @@
#define PAGE_CACHE_MASK PAGE_MASK
#define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)

-#define page_cache_get(x) get_page(x)
+/*
+ * Get a reference to the page. This function must not be called on
+ * a dead page, i.e. a page that has page count zero. If the page is
+ * still on a lru_list use page_cache_get_lru instead.
+ */
+static inline void page_cache_get (struct page * page)
+{
+ BUG_ON(page_count(page) == 0);
+ get_page(page);
+}
+
+/*
+ * Try to get a reference to page that we found on an lru list.
+ * The lru lists may contain pages with page count zero. We must
+ * not take a reference to such a page because it is already about
+ * to be freed (once it is of the lru lists). If we'd take a reference
+ * the page would eventually be freed twice.
+ *
+ * The return value is true if we sucessfully incremented the page count.
+ *
+ * This function must be called with the lru lock held.
+ */
+static inline int page_cache_get_lru (struct page * page)
+{
+ /*
+ * Yes there is a window where the page count is not zero
+ * even though the page is dead. This is one of the reasons
+ * why the caller must hold the lru lock. Due to the lru_lock
+ * only the thread that is about to free the page can have
+ * a reference to this page. This thread will not test the
+ * page count anymore.
+ */
+ if (!get_page_testzero (page))
+ return 1;
+ __put_page (page);
+ return 0;
+}

static inline void page_cache_release(struct page *page)
{
diff -ur linux-2.5.31-bk7/mm/swap.c linux-2.5.31-cae/mm/swap.c
--- linux-2.5.31-bk7/mm/swap.c Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/mm/swap.c Sun Aug 25 11:28:55 2002
@@ -77,7 +77,6 @@
void __page_cache_release(struct page *page)
{
unsigned long flags;
- BUG_ON(page_count(page) != 0);

spin_lock_irqsave(&_pagemap_lru_lock, flags);
if (TestClearPageLRU(page)) {
@@ -86,11 +85,8 @@
else
del_page_from_inactive_list(page);
}
- if (page_count(page) != 0)
- page = NULL;
spin_unlock_irqrestore(&_pagemap_lru_lock, flags);
- if (page)
- __free_pages_ok(page, 0);
+ __free_pages_ok(page, 0);
}

/*
@@ -131,8 +127,7 @@
else
del_page_from_inactive_list(page);
}
- if (page_count(page) == 0)
- pagevec_add(&pages_to_free, page);
+ pagevec_add(&pages_to_free, page);
}
if (lock_held)
spin_unlock_irq(&_pagemap_lru_lock);
diff -ur linux-2.5.31-bk7/mm/vmscan.c linux-2.5.31-cae/mm/vmscan.c
--- linux-2.5.31-bk7/mm/vmscan.c Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/mm/vmscan.c Sun Aug 25 21:44:07 2002
@@ -92,6 +92,10 @@
return page_count(page) - !!PagePrivate(page) == 2;
}

+/*
+ * The caller must hold a reference to each page in the list. We drop
+ * this reference if and only if we remove the page from the page_list.
+ */
static /* inline */ int
shrink_list(struct list_head *page_list, int nr_pages, zone_t *classzone,
unsigned int gfp_mask, int priority, int *max_scan)
@@ -295,24 +299,23 @@
spin_lock_irq(&_pagemap_lru_lock);
while (max_scan > 0 && nr_pages > 0) {
struct page *page;
+ struct list_head * curr;
int n = 0;

- while (n < nr_to_process && !list_empty(&inactive_list)) {
- page = list_entry(inactive_list.prev, struct page, lru);
+ curr = inactive_list.prev;
+ while (n < nr_to_process && (&inactive_list != curr)) {
+ page = list_entry(curr, struct page, lru);

- prefetchw_prev_lru_page(page, &inactive_list, flags);
+ prefetchw_prev_lru_page(page, curr, flags);
+ curr = curr->prev;

+ /* Is the page already dead ? */
+ if (!page_cache_get_lru (page))
+ continue;
if (!TestClearPageLRU(page))
BUG();
list_del(&page->lru);
- if (page_count(page) == 0) {
- /* It is currently in pagevec_release() */
- SetPageLRU(page);
- list_add(&page->lru, &inactive_list);
- continue;
- }
list_add(&page->lru, &page_list);
- page_cache_get(page);
n++;
}
spin_unlock_irq(&_pagemap_lru_lock);
@@ -381,15 +384,19 @@
LIST_HEAD(l_active); /* Pages to go onto the active_list */
struct page *page;
struct pagevec pvec;
+ struct list_head *curr;

lru_add_drain();
spin_lock_irq(&_pagemap_lru_lock);
- while (nr_pages && !list_empty(&active_list)) {
- page = list_entry(active_list.prev, struct page, lru);
- prefetchw_prev_lru_page(page, &active_list, flags);
+ curr = active_list.prev;
+ while (nr_pages && (&active_list != curr)) {
+ page = list_entry(curr, struct page, lru);
+ prefetchw_prev_lru_page(page, curr, flags);
+ curr = curr->prev;
+ if (!page_cache_get_lru (page))
+ continue;
if (!TestClearPageLRU(page))
BUG();
- page_cache_get(page);
list_move(&page->lru, &l_hold);
nr_pages--;
}

--
THAT'S ALL FOLKS!

2002-08-26 14:17:22

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> + * A special Problem is the lru lists. Presence on one of these lists
> + * does not increase the page count.

Please remind me... why should it not?

--
Daniel

2002-08-26 15:25:35

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > + * A special Problem is the lru lists. Presence on one of these lists
> > + * does not increase the page count.
>
> Please remind me... why should it not?

Pages that are only on the lru but not reference by anyone are of no
use and we want to free them immediatly. If we leave them on the lru
list with a page count of 1, someone else will have to walk the lru
list and remove pages that are only on the lru. One could argue that
try_to_free_pages could do this but try_to_free_pages will process the
pages in lru order and push out other pages first.
The next suggestion that comes to mind is: Let's have some magic in
page_cache_release that will remove the page from the lru list if
it is actually dead. However, this raises the question: How do we detect
that a page is now dead? The answer is something along the lines of

if (put_page_testzero ()) {
__free_pages_ok (page);
return
}
spin_lock_irq(pagemap_lru_lock);
if (PageLRU(page) && (page_count(page) == 1)) {
lru_cache_del (page);
spin_unlock_irq(pagemap_lru_lock);
page_cache_release (page);
return;
}
spin_unlock_irq(pagemap_lru_lock);
return;

The sad truth is, that this solution has all the same races that
we have now, plus it makes the fast path (decreasing page count
to something not zero) slower. One problem in the above would be
that the last reference might as well not be due the the lru
cache, i.e at the time we call PageLRU(page) the page might
have been freed by another processor.

I know the idea is appealing (see one of my earlier Mails on the
subject ;-) ) but it doesn't solve the Problem.

regards Christian Ehrhardt

--
THAT'S ALL FOLKS!

2002-08-26 17:52:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>
>What I'm inclined to do there is to change __page_cache_release()
>to not attempt to free the page at all. Just let it sit on the
>LRU until page reclaim encounters it. With the anon-free-via-pagevec
>patch, very, very, very few pages actually get their final release in
>__page_cache_release() - zero on uniprocessor, I expect.

If you do this, then I would personally suggest a conceptually different
approach: make the LRU list count towards the page count. That will
_automatically_ result in what you describe - if a page is on the LRU
list, then "freeing" it will always just decrement the count, and the
_real_ free comes from walking the LRU list and considering count==1 to
be trivially freeable.

That way you don't have to have separate functions for releasing
different kinds of pages (we've seen how nasty that was from a
maintainance standpoint already with the "put_page vs
page_cache_release" thing).

Ehh?

Linus

2002-08-26 18:12:38

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > + * A special Problem is the lru lists. Presence on one of these lists
> > > + * does not increase the page count.
> >
> > Please remind me... why should it not?
>
> Pages that are only on the lru but not reference by anyone are of no
> use and we want to free them immediatly. If we leave them on the lru
> list with a page count of 1, someone else will have to walk the lru
> list and remove pages that are only on the lru.

I don't understand this argument. Suppose lru list membership is worth a
page count of one. Then anyone who finds a page by way of the lru list can
safely put_page_testzero and remove the page from the lru list. Anyone who
finds a page by way of a page table can likewise put_page_testzero and clear
the pte, or remove the mapping and pass the page to Andrew's pagevec
machinery, which will eventually do the put_page_testzero. Anyone who
removes a page from a radix tree will also do a put_page_testzero. Exactly
one of those paths will result in the page count reaching zero, which tells
us nobody else holds a reference and it's time for __free_pages_ok. The page
is thus freed immediately as soon as there are no more references to it, and
does not hang around on the lru list.

Nobody has to lock against the page count. Each put_page_testzero caller
only locks the data structure from which it's removing the reference.

This seems so simple, what is the flaw?

--
Daniel

2002-08-26 19:09:27

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Daniel Phillips wrote:
>
> On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > + * does not increase the page count.
> > >
> > > Please remind me... why should it not?
> >
> > Pages that are only on the lru but not reference by anyone are of no
> > use and we want to free them immediatly. If we leave them on the lru
> > list with a page count of 1, someone else will have to walk the lru
> > list and remove pages that are only on the lru.
>
> I don't understand this argument. Suppose lru list membership is worth a
> page count of one. Then anyone who finds a page by way of the lru list can
> safely put_page_testzero and remove the page from the lru list. Anyone who
> finds a page by way of a page table can likewise put_page_testzero and clear
> the pte, or remove the mapping and pass the page to Andrew's pagevec
> machinery, which will eventually do the put_page_testzero. Anyone who
> removes a page from a radix tree will also do a put_page_testzero. Exactly
> one of those paths will result in the page count reaching zero, which tells
> us nobody else holds a reference and it's time for __free_pages_ok. The page
> is thus freed immediately as soon as there are no more references to it, and
> does not hang around on the lru list.
>
> Nobody has to lock against the page count. Each put_page_testzero caller
> only locks the data structure from which it's removing the reference.
>
> This seems so simple, what is the flaw?

The flaw is in doing the put_page_testzero() outside of any locking
which would prevent other CPUs from finding and "rescuing" the zero-recount
page.

CPUA:
if (put_page_testzero()) {
/* Here's the window */
spin_lock(lru_lock);
list_del(page->lru);

CPUB:

spin_lock(lru_lock);
page = list_entry(lru);
page_cache_get(page); /* If this goes from 0->1, we die */
...
page_cache_release(page); /* double free */


2.5.31-mm1 has tests which make this race enormously improbable [1],
but it's still there.

It's that `put' outside the lock which is the culprit. Normally, we
handle that with atomic_dec_and_lock() (inodes) or by manipulating
the refcount inside an area which has exclusion (page presence in
pagecache).

The sane, sensible and sucky way is to always take the lock:

page_cache_release(page)
{
spin_lock(lru_lock);
if (put_page_testzero(page)) {
lru_cache_del(page);
__free_pages_ok(page, 0);
}
spin_unlock(lru_lock);
}

Because this provides exclusion from another CPU discovering the page
via the LRU.

So taking the above as the design principle, how can we speed it up?
How to avoid taking the lock in every page_cache_release()? Maybe:

page_cache_release(page)
{
if (page_count(page) == 1) {
spin_lock(lru_lock);
if (put_page_testzero(page)) {
if (PageLRU(page))
__lru_cache_del(page);
__free_pages_ok(page);
}
spin_unlock(lru_lock);
} else {
atomic_dec(&page->count);
}
}

This is nice and quick, but racy. Two concurrent page_cache_releases
will create a zero-ref unfreed page which is on the LRU. These are
rare, and can be mopped up in page reclaim.

The above code will also work for pages which aren't on the LRU. It will
take the lock unnecessarily for (say) slab pages. But if we put slab pages
on the LRU then I suspect there are so few non-LRU pages left that it isn't
worth bothering about this.


[1] The race requires that the CPU running page_cache_release find a
five instruction window against the CPU running shrink_cache. And
that they be operating against the same page. And that the CPU
running __page_cache_release() then take an interrupt in a 3-4
instruction window. And that the interrupt take longer than the
runtime for shrink_list. And that the page be the first page in
the pagevec.

It's a heat-death-of-the-universe-race, but even if it were to be
ignored, the current code is too complex.

2002-08-26 19:44:41

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote:
> The flaw is in doing the put_page_testzero() outside of any locking

Well, one could argue that doing the put_page_testzero outside of any
locking is a feature.

> [ ... ]
>
> 2.5.31-mm1 has tests which make this race enormously improbable [1],
> but it's still there.

Agreed. Both on the improbable and on the still there part.

> It's that `put' outside the lock which is the culprit. Normally, we
> handle that with atomic_dec_and_lock() (inodes) or by manipulating
> the refcount inside an area which has exclusion (page presence in
> pagecache).
>
> The sane, sensible and sucky way is to always take the lock:
>
> page_cache_release(page)
> {
> spin_lock(lru_lock);
> if (put_page_testzero(page)) {
> lru_cache_del(page);
> __free_pages_ok(page, 0);
> }
> spin_unlock(lru_lock);
> }

That would probably solve the problem.

> Because this provides exclusion from another CPU discovering the page
> via the LRU.
>
> So taking the above as the design principle, how can we speed it up?
> How to avoid taking the lock in every page_cache_release()? Maybe:
>
> page_cache_release(page)
> {
> if (page_count(page) == 1) {
> spin_lock(lru_lock);
> if (put_page_testzero(page)) {
> if (PageLRU(page))
> __lru_cache_del(page);
> __free_pages_ok(page);
> }
> spin_unlock(lru_lock);
> } else {
> atomic_dec(&page->count);
> }
> }

However, this is an incredibly bad idea if the page is NOT on the lru.
Think of two instances of page_cache_release racing against each other.
This could result in a leaked page which is not on the LRU.

> This is nice and quick, but racy. Two concurrent page_cache_releases
> will create a zero-ref unfreed page which is on the LRU. These are
> rare, and can be mopped up in page reclaim.
>
> The above code will also work for pages which aren't on the LRU. It will
> take the lock unnecessarily for (say) slab pages. But if we put slab pages
> on the LRU then I suspect there are so few non-LRU pages left that it isn't
> worth bothering about this.

No it will not work. See above.

> [1] The race requires that the CPU running page_cache_release find a
> five instruction window against the CPU running shrink_cache. And
> that they be operating against the same page. And that the CPU
> running __page_cache_release() then take an interrupt in a 3-4
> instruction window. And that the interrupt take longer than the
> runtime for shrink_list. And that the page be the first page in
> the pagevec.

The interrupt can also be a preemption which might easily take long
enough. But I agree that the race is now rare. The real problem is
that the locking rules don't guarantee that there are no other racy
paths that we both missed.

regards Christian

--
THAT'S ALL FOLKS!

2002-08-26 19:30:56

by Rik van Riel

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, 26 Aug 2002, Linus Torvalds wrote:

> If you do this, then I would personally suggest a conceptually different
> approach: make the LRU list count towards the page count. That will
> _automatically_ result in what you describe - if a page is on the LRU
> list, then "freeing" it will always just decrement the count, and the
> _real_ free comes from walking the LRU list and considering count==1 to
> be trivially freeable.

We can turn these into per-cpu "garbage collect" LRUs, if
we're holding the lock anyway when we decrement the count
we can move the page to a place where it can be found
easily.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-08-26 19:50:07

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Monday 26 August 2002 21:24, Andrew Morton wrote:
> Daniel Phillips wrote:
> > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > + * does not increase the page count.
> > > >
> > > > Please remind me... why should it not?
> > >
> > > Pages that are only on the lru but not reference by anyone are of no
> > > use and we want to free them immediatly. If we leave them on the lru
> > > list with a page count of 1, someone else will have to walk the lru
> > > list and remove pages that are only on the lru.
> >
> > I don't understand this argument. Suppose lru list membership is worth a
> > page count of one. Then anyone who finds a page by way of the lru list can
> > safely put_page_testzero and remove the page from the lru list. Anyone who
> > finds a page by way of a page table can likewise put_page_testzero and clear
> > the pte, or remove the mapping and pass the page to Andrew's pagevec
> > machinery, which will eventually do the put_page_testzero. Anyone who
> > removes a page from a radix tree will also do a put_page_testzero. Exactly
> > one of those paths will result in the page count reaching zero, which tells
> > us nobody else holds a reference and it's time for __free_pages_ok. The page
> > is thus freed immediately as soon as there are no more references to it, and
> > does not hang around on the lru list.
> >
> > Nobody has to lock against the page count. Each put_page_testzero caller
> > only locks the data structure from which it's removing the reference.
> >
> > This seems so simple, what is the flaw?
>
> The flaw is in doing the put_page_testzero() outside of any locking
> which would prevent other CPUs from finding and "rescuing" the zero-recount
> page.
>
> CPUA:
> if (put_page_testzero()) {
> /* Here's the window */
> spin_lock(lru_lock);
> list_del(page->lru);

According to my assumption that lru list membership is (should be) worth one
page count, if testzero triggers here the page is not on the lru.

> CPUB:
>
> spin_lock(lru_lock);
> page = list_entry(lru);
> page_cache_get(page); /* If this goes from 0->1, we die */

It can't. You know that because you found the page on the lru, its count
must be at least one (again, according to assumption above).

> ...
> page_cache_release(page); /* double free */

I'd like to jump in and chase more solutions with you, but the above doesn't
prove your point, so I'm not ready to reject this one yet.

--
Daniel

2002-08-26 19:56:55

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > + * does not increase the page count.
> > >
> > > Please remind me... why should it not?
> >
> > Pages that are only on the lru but not reference by anyone are of no
> > use and we want to free them immediatly. If we leave them on the lru
> > list with a page count of 1, someone else will have to walk the lru
> > list and remove pages that are only on the lru.
>
> I don't understand this argument. Suppose lru list membership is worth a
> page count of one. Then anyone who finds a page by way of the lru list can

This does fix the double free problem but think of a typical anonymous
page at exit. The page is on the lru list and there is one reference held
by the pte. According to your scheme the pte reference would be freed
(obviously due to the exit) but the page would remain on the lru list.
However, there is no point in leaving the page on the lru list at all.

If you think about who is going to remove the page from the lru you'll
see the problem.

regards Christian

--
THAT'S ALL FOLKS!

2002-08-26 20:25:19

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > + * does not increase the page count.
> > > >
> > > > Please remind me... why should it not?
> > >
> > > Pages that are only on the lru but not reference by anyone are of no
> > > use and we want to free them immediatly. If we leave them on the lru
> > > list with a page count of 1, someone else will have to walk the lru
> > > list and remove pages that are only on the lru.
> >
> > I don't understand this argument. Suppose lru list membership is worth a
> > page count of one. Then anyone who finds a page by way of the lru list can
>
> This does fix the double free problem but think of a typical anonymous
> page at exit. The page is on the lru list and there is one reference held
> by the pte. According to your scheme the pte reference would be freed
> (obviously due to the exit) but the page would remain on the lru list.
> However, there is no point in leaving the page on the lru list at all.

If you want the page off the lru list at that point (which you probably do)
then you take the lru lock and put_page_testzero.

> If you think about who is going to remove the page from the lru you'll
> see the problem.

Nope, still don't see it. Whoever hits put_page_testzero frees the page,
secure in the knowlege that there are no other references to it.

--
Daniel

2002-08-26 20:54:42

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > > + * does not increase the page count.
> > > > >
> > > > > Please remind me... why should it not?
> > > >
> > > > Pages that are only on the lru but not reference by anyone are of no
> > > > use and we want to free them immediatly. If we leave them on the lru
> > > > list with a page count of 1, someone else will have to walk the lru
> > > > list and remove pages that are only on the lru.
> > >
> > > I don't understand this argument. Suppose lru list membership is worth a
> > > page count of one. Then anyone who finds a page by way of the lru list can
> >
> > This does fix the double free problem but think of a typical anonymous
> > page at exit. The page is on the lru list and there is one reference held
> > by the pte. According to your scheme the pte reference would be freed
> > (obviously due to the exit) but the page would remain on the lru list.
> > However, there is no point in leaving the page on the lru list at all.
>
> If you want the page off the lru list at that point (which you probably do)
> then you take the lru lock and put_page_testzero.

Could you clarify what you mean with "at that point"? Especially how
do you plan to test for "this point". Besides it is illegal to use
the page after put_page_testzero (unless put_page_testzero returns true).

> > If you think about who is going to remove the page from the lru you'll
> > see the problem.
>
> Nope, still don't see it. Whoever hits put_page_testzero frees the page,
> secure in the knowlege that there are no other references to it.

Well yes, but we cannot remove the page from the lru atomatically
at page_cache_release time if we follow your proposal. If you think we can,
show me your implementation of page_cache_release and I'll show
you where the races are (unless you do everything under the lru_lock
of course).

regards Christian

--
THAT'S ALL FOLKS!

2002-08-26 21:16:32

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Daniel Phillips wrote:
>
> ...
> > If you think about who is going to remove the page from the lru you'll
> > see the problem.
>
> Nope, still don't see it. Whoever hits put_page_testzero frees the page,
> secure in the knowlege that there are no other references to it.

Sure. But this requires that the caller of page_cache_release() has
previously removed the page from the LRU. We (used to) do that for truncate
and page reclaim. But we did not do that for anon pages.

For anon pages, we perform magical LRU removal when the page refcount
goes to zero.

The fact that we performed explicit removal in one place, and magical removal
in the other was unfortunate. I nuked the explicit removal and made it
all magical (explicit removal in truncate_complete_page() was wrong anyway - the
page could have been rescued and anonymised by concurrent pagefault and must
stay on the LRU).

Possibly, we could go back to explicit removal everywhere. Haven't
really looked at that, but I suspect we're back to a similar problem:
to do you unracily determine whether the page should be removed from
the LRU? Take ->page_table_lock and look at page_count(page)? Worried.

I like the magical-removal-just-before-free, and my gut feel is that
it'll provide a cleaner end result.

Making presence on the LRU contribute to page->count is attractive,
if only because it removes some irritating and expensive page_cache_gets
and puts from shrink_cache and refill_inactive. But for it to be useful,
we must perform explicit removal everywhere.

Making presence on the LRU contribute to page->count doesn't fundamentally
change anything of course - it offsets the current problems by one.

Then again, it would remove all page_cache_gets/releases from vmscan.c
and may thus make the race go away. That's a bit of a timebomb though.

2002-08-26 22:08:05

by Ed Tomlinson

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

This seems to have been missed:

Linus Torvalds wrote:

> In article <[email protected]>,
> Andrew Morton <[email protected]> wrote:
>>
>>What I'm inclined to do there is to change __page_cache_release()
>>to not attempt to free the page at all. Just let it sit on the
>>LRU until page reclaim encounters it. With the anon-free-via-pagevec
>>patch, very, very, very few pages actually get their final release in
>>__page_cache_release() - zero on uniprocessor, I expect.
>
> If you do this, then I would personally suggest a conceptually different
> approach: make the LRU list count towards the page count. That will
> _automatically_ result in what you describe - if a page is on the LRU
> list, then "freeing" it will always just decrement the count, and the
> _real_ free comes from walking the LRU list and considering count==1 to
> be trivially freeable.
>
> That way you don't have to have separate functions for releasing
> different kinds of pages (we've seen how nasty that was from a
> maintainance standpoint already with the "put_page vs
> page_cache_release" thing).
>
> Ehh?

If every structure locks before removing its reference (ie before testing and/or
removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock
etc) Its a bit of an audit task to make sure the various locks are taken (and
documented) though.

By leting the actual free be lazy as Linus suggests things should simplify nicely.

comments,
Ed Tomlinson

2002-08-26 23:43:25

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Ed Tomlinson wrote:
>
> This seems to have been missed:

Still thinking about it.

> Linus Torvalds wrote:
>
> > In article <[email protected]>,
> > Andrew Morton <[email protected]> wrote:
> >>
> >>What I'm inclined to do there is to change __page_cache_release()
> >>to not attempt to free the page at all. Just let it sit on the
> >>LRU until page reclaim encounters it. With the anon-free-via-pagevec
> >>patch, very, very, very few pages actually get their final release in
> >>__page_cache_release() - zero on uniprocessor, I expect.
> >
> > If you do this, then I would personally suggest a conceptually different
> > approach: make the LRU list count towards the page count. That will
> > _automatically_ result in what you describe - if a page is on the LRU
> > list, then "freeing" it will always just decrement the count, and the
> > _real_ free comes from walking the LRU list and considering count==1 to
> > be trivially freeable.
> >
> > That way you don't have to have separate functions for releasing
> > different kinds of pages (we've seen how nasty that was from a
> > maintainance standpoint already with the "put_page vs
> > page_cache_release" thing).
> >
> > Ehh?
>
> If every structure locks before removing its reference (ie before testing and/or
> removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock
> etc) Its a bit of an audit task to make sure the various locks are taken (and
> documented) though.
>
> By leting the actual free be lazy as Linus suggests things should simplify nicely.

Well we wouldn't want to leave tons of free pages on the LRU - the
VM would needlessly reclaim pagecache before finding the free pages. And
higher-order page allocations could suffer.

If we go for explicit lru removal in truncate and zap_pte_range
then this approach may be best. Still thinking about it.

2002-08-27 00:09:55

by Rik van Riel

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, 26 Aug 2002, Andrew Morton wrote:

> Well we wouldn't want to leave tons of free pages on the LRU - the VM
> would needlessly reclaim pagecache before finding the free pages. And
> higher-order page allocations could suffer.

We did this with the swap cache in 2.4.<7 and it was an
absolute disaster.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-08-27 03:38:14

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote:
> I like the magical-removal-just-before-free, and my gut feel is that
> it'll provide a cleaner end result.

For the record, I'd rather see explicite removal everwhere. We received
a number of complaints along the lines of "I run my app immediately after
system startup, and it's fast, but the second time it's slower" due to
the lazy page reclaim in early 2.4. Until there's a way to make LRU
scanning faster than page allocation, it can't be lazy.

-ben

2002-08-27 04:22:12

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Benjamin LaHaise wrote:
>
> On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote:
> > I like the magical-removal-just-before-free, and my gut feel is that
> > it'll provide a cleaner end result.
>
> For the record, I'd rather see explicite removal everwhere. We received
> a number of complaints along the lines of "I run my app immediately after
> system startup, and it's fast, but the second time it's slower" due to
> the lazy page reclaim in early 2.4. Until there's a way to make LRU
> scanning faster than page allocation, it can't be lazy.
>

I think that's what Rik was referring to.

But here, "explicit removal" refers to running lru_cache_del() prior
to the final put_page, rather than within the context of the final
put_page. So it's a different thing.

2002-08-27 09:18:03

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote:
> The flaw is in doing the put_page_testzero() outside of any locking
> which would prevent other CPUs from finding and "rescuing" the zero-recount
> page.
>
> CPUA:
> if (put_page_testzero()) {
> /* Here's the window */
> spin_lock(lru_lock);
> list_del(page->lru);
>
> CPUB:
>
> spin_lock(lru_lock);
> page = list_entry(lru);
> page_cache_get(page); /* If this goes from 0->1, we die */
> ...
> page_cache_release(page); /* double free */

So what we want CPUB do instead is

spin_lock(lru_lock);
page = list_entry(lru)

START ATOMIC
page_cache_get(page);
res = (page_count (page) == 1)
END ATOMIC

if (res) {
atomic_dec (&page->count);
continue; /* with next page */
}
...
page_cache_release (page);

I.e. we want to detect _atomically_ that we just raised the page count
from zero to one. My patch actually has a solution that implements the
needed atomic operation above by means of the atomic functions that we
currently have on all archs (it's called get_page_testzero and
should probably called get_page_testone).
The more I think about this the more I think this is the way to go.

regards Christian

--
THAT'S ALL FOLKS!

2002-08-27 17:05:22

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Monday 26 August 2002 22:58, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > > > + * does not increase the page count.
> > > > > >
> > > > > > Please remind me... why should it not?
> > > > >
> > > > > Pages that are only on the lru but not reference by anyone are of no
> > > > > use and we want to free them immediatly. If we leave them on the lru
> > > > > list with a page count of 1, someone else will have to walk the lru
> > > > > list and remove pages that are only on the lru.
> > > >
> > > > I don't understand this argument. Suppose lru list membership is worth a
> > > > page count of one. Then anyone who finds a page by way of the lru list can
> > >
> > > This does fix the double free problem but think of a typical anonymous
> > > page at exit. The page is on the lru list and there is one reference held
> > > by the pte. According to your scheme the pte reference would be freed
> > > (obviously due to the exit) but the page would remain on the lru list.
> > > However, there is no point in leaving the page on the lru list at all.
> >
> > If you want the page off the lru list at that point (which you probably do)
> > then you take the lru lock and put_page_testzero.
>
> Could you clarify what you mean with "at that point"? Especially how
> do you plan to test for "this point". Besides it is illegal to use
> the page after put_page_testzero (unless put_page_testzero returns true).

> > > If you think about who is going to remove the page from the lru you'll
> > > see the problem.
> >
> > Nope, still don't see it. Whoever hits put_page_testzero frees the page,
> > secure in the knowlege that there are no other references to it.
>
> Well yes, but we cannot remove the page from the lru atomatically
> at page_cache_release time if we follow your proposal. If you think we can,
> show me your implementation of page_cache_release and I'll show
> you where the races are (unless you do everything under the lru_lock
> of course).

void page_cache_release(struct page *page)
{
spin_lock(&pagemap_lru_lock);
if (PageLRU(page) && page_count(page) == 2) {
__lru_cache_del(page);
atomic_dec(&page->count);
}
spin_unlock(&pagemap_lru_lock);
if (put_page_testzero(page))
__free_pages_ok(page, 0);
}

This allows the following benign race, with initial page count = 3:

spin_lock(&pagemap_lru_lock);
if (PageLRU(page) && page_count(page) == 2) /* false */
spin_unlock(&pagemap_lru_lock);
spin_lock(&pagemap_lru_lock);
if (PageLRU(page) && page_count(page) == 2) /* false */
spin_unlock(&pagemap_lru_lock);
if (put_page_testzero(page))
__free_pages_ok(page, 0);
if (put_page_testzero(page))
__free_pages_ok(page, 0);

Neither holder of a page reference sees the count at 2, and so the page
is left on the lru with count = 1. This won't happen often and such
pages will be recovered from the cold end of the list in due course.

The important question is: can this code ever remove a page from the lru
erroneously, leaving somebody holding a reference to a non-lru page? In
other words, can the test PageLRU(page) && page_count(page) == 2 return
a false positive? Well, when this test is true we can account for both
both references: the one we own, and the one the lru list owns. Since
we hold the lru lock, the latter won't change. Nobody else has the
right to increment the page count, since they must inherit that right
from somebody who holds a reference, and there are none.

We could also do this:

void page_cache_release(struct page *page)
{
if (page_count(page) == 2) {
spin_lock(&pagemap_lru_lock);
if (PageLRU(page) && page_count(page) == 2) {
__lru_cache_del(page);
atomic_dec(&page->count);
}
spin_unlock(&pagemap_lru_lock);
}
if (put_page_testzero(page))
__free_pages_ok(page, 0);
}

Which avoids taking the lru lock sometimes in exchange for widening the
hole through which pages can end up with count = 1 on the lru list.

Let's run this through your race detector and see what happens.

--
Daniel

2002-08-27 19:16:57

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Christian Ehrhardt wrote:
>
> ...
> So what we want CPUB do instead is
>
> spin_lock(lru_lock);
> page = list_entry(lru)
>
> START ATOMIC
> page_cache_get(page);
> res = (page_count (page) == 1)
> END ATOMIC
>
> if (res) {
> atomic_dec (&page->count);
> continue; /* with next page */
> }
> ...
> page_cache_release (page);
>
> I.e. we want to detect _atomically_ that we just raised the page count
> from zero to one. My patch actually has a solution that implements the
> needed atomic operation above by means of the atomic functions that we
> currently have on all archs (it's called get_page_testzero and
> should probably called get_page_testone).
> The more I think about this the more I think this is the way to go.
>

Yes, I think that would provide a minimal fix to the problem.
(I'd prefer a solution in which presence on the LRU contributes
to page->count, because that means I can dump a load of expensive
page_cache_get-inside-lru-lock instances, but whatever)

You had:

-#define put_page_testzero(p) atomic_dec_and_test(&(p)->count)
-#define page_count(p) atomic_read(&(p)->count)
-#define set_page_count(p,v) atomic_set(&(p)->count, v)
+#define put_page_testzero(p) atomic_add_negative(-1, &(p)->count)
+#define page_count(p) (1+atomic_read(&(p)->count))
+#define set_page_count(p,v) atomic_set(&(p)->count, v-1)
+#define get_page_testzero(p) atomic_inc_and_test(&(p)->count)

So the page count is actually offset by -1, and that is hidden by
the macros. Fair enough.

atomic_add_negative() is not implemented on quite a number of
architectures (sparc64, mips, ppc, sh, cris, 68k, alpha..), so
some legwork is needed there. Looks to be pretty simple though;
alpha, ppc and others already have atomic_add_return().

2002-08-28 13:10:28

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Tue, Aug 27, 2002 at 06:48:50PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 22:58, Christian Ehrhardt wrote:
> > > Nope, still don't see it. Whoever hits put_page_testzero frees the page,
> > > secure in the knowlege that there are no other references to it.
> >
> > Well yes, but we cannot remove the page from the lru atomatically
> > at page_cache_release time if we follow your proposal. If you think we can,
> > show me your implementation of page_cache_release and I'll show
> > you where the races are (unless you do everything under the lru_lock
> > of course).
>
> void page_cache_release(struct page *page)
> {
> spin_lock(&pagemap_lru_lock);
> if (PageLRU(page) && page_count(page) == 2) {
> __lru_cache_del(page);
> atomic_dec(&page->count);
> }
> spin_unlock(&pagemap_lru_lock);
> if (put_page_testzero(page))
> __free_pages_ok(page, 0);
> }
>
> This allows the following benign race, with initial page count = 3:
> [ ...]
> Neither holder of a page reference sees the count at 2, and so the page
> is left on the lru with count = 1. This won't happen often and such
> pages will be recovered from the cold end of the list in due course.

Ok, agreed. I think this will work but taking the lru lock each time
is probably not a good idea.

> We could also do this:
>
> void page_cache_release(struct page *page)
> {
> if (page_count(page) == 2) {
> spin_lock(&pagemap_lru_lock);
> if (PageLRU(page) && page_count(page) == 2) {
> __lru_cache_del(page);
> atomic_dec(&page->count);
> }
> spin_unlock(&pagemap_lru_lock);
> }
> if (put_page_testzero(page))
> __free_pages_ok(page, 0);
> }
>
> Which avoids taking the lru lock sometimes in exchange for widening the
> hole through which pages can end up with count = 1 on the lru list.

This sounds like something that is worth trying. I missed that one.


Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
or shrink_cache could have removed the page from the lru before
__pagevec_lru_del acquired the lru lock.

regards Christian

--
THAT'S ALL FOLKS!

2002-08-28 17:13:15

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote:
> Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
> or shrink_cache could have removed the page from the lru before
> __pagevec_lru_del acquired the lru lock.

It's suspect all right. If there's a chain of assumptions that proves
the page is always on the lru at the point, I haven't seen it yet.

--
Daniel

2002-08-28 17:27:03

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Daniel Phillips wrote:
>
> On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote:
> > Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
> > or shrink_cache could have removed the page from the lru before
> > __pagevec_lru_del acquired the lru lock.
>
> It's suspect all right. If there's a chain of assumptions that proves
> the page is always on the lru at the point, I haven't seen it yet.

Yeah. __pagevec_lru_del is only used by invalidate_inode_pages.
A very simple solution is to just delete it.

untested code:

include/linux/pagevec.h | 7 -------
mm/filemap.c | 10 +++++-----
mm/swap.c | 28 ----------------------------
3 files changed, 5 insertions(+), 40 deletions(-)

--- 2.5.32/mm/filemap.c~pagevec_lru_del Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/mm/filemap.c Wed Aug 28 09:51:51 2002
@@ -116,10 +116,10 @@ void invalidate_inode_pages(struct inode
struct list_head *head, *curr;
struct page * page;
struct address_space *mapping = inode->i_mapping;
- struct pagevec lru_pvec;
+ struct pagevec pvec;

head = &mapping->clean_pages;
- pagevec_init(&lru_pvec);
+ pagevec_init(&pvec);
write_lock(&mapping->page_lock);
curr = head->next;

@@ -143,8 +143,8 @@ void invalidate_inode_pages(struct inode

__remove_from_page_cache(page);
unlock_page(page);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_del(&lru_pvec);
+ if (!pagevec_add(&pvec, page))
+ __pagevec_release(&pvec);
continue;
unlock:
unlock_page(page);
@@ -152,7 +152,7 @@ unlock:
}

write_unlock(&mapping->page_lock);
- pagevec_lru_del(&lru_pvec);
+ pagevec_release(&pvec);
}

static int do_invalidatepage(struct page *page, unsigned long offset)
--- 2.5.32/include/linux/pagevec.h~pagevec_lru_del Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/include/linux/pagevec.h Wed Aug 28 09:51:51 2002
@@ -18,7 +18,6 @@ void __pagevec_release(struct pagevec *p
void __pagevec_release_nonlru(struct pagevec *pvec);
void __pagevec_free(struct pagevec *pvec);
void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_del(struct pagevec *pvec);
void lru_add_drain(void);
void pagevec_deactivate_inactive(struct pagevec *pvec);

@@ -69,9 +68,3 @@ static inline void pagevec_lru_add(struc
if (pagevec_count(pvec))
__pagevec_lru_add(pvec);
}
-
-static inline void pagevec_lru_del(struct pagevec *pvec)
-{
- if (pagevec_count(pvec))
- __pagevec_lru_del(pvec);
-}
--- 2.5.32/mm/swap.c~pagevec_lru_del Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/mm/swap.c Wed Aug 28 09:51:58 2002
@@ -214,34 +214,6 @@ void __pagevec_lru_add(struct pagevec *p
}

/*
- * Remove the passed pages from the LRU, then drop the caller's refcount on
- * them. Reinitialises the caller's pagevec.
- */
-void __pagevec_lru_del(struct pagevec *pvec)
-{
- int i;
- struct zone *zone = NULL;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- struct zone *pagezone = page_zone(page);
-
- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- if (!TestClearPageLRU(page))
- BUG();
- del_page_from_lru(zone, page);
- }
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- pagevec_release(pvec);
-}
-
-/*
* Perform any setup for the swap system
*/
void __init swap_setup(void)

.

2002-08-28 20:36:12

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Going right back to basics, what do you suppose is wrong with the 2.4
strategy of always doing the lru removal in free_pages_ok?

--
Daniel

2002-08-28 21:01:22

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Daniel Phillips wrote:
>
> Going right back to basics, what do you suppose is wrong with the 2.4
> strategy of always doing the lru removal in free_pages_ok?

That's equivalent to what we have at present, which is:

if (put_page_testzero(page)) {
/* window here */
lru_cache_del(page);
__free_pages_ok(page, 0);
}

versus:

spin_lock(lru lock);
page = list_entry(lru, ...);
if (page_count(page) == 0)
continue;
/* window here */
page_cache_get(page);
page_cache_release(page); /* double-free */

2002-08-28 22:20:24

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Wednesday 28 August 2002 23:03, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > Going right back to basics, what do you suppose is wrong with the 2.4
> > strategy of always doing the lru removal in free_pages_ok?
>
> That's equivalent to what we have at present, which is:
>
> if (put_page_testzero(page)) {
> /* window here */
> lru_cache_del(page);
> __free_pages_ok(page, 0);
> }
>
> versus:
>
> spin_lock(lru lock);
> page = list_entry(lru, ...);
> if (page_count(page) == 0)
> continue;
> /* window here */
> page_cache_get(page);
> page_cache_release(page); /* double-free */

Indeed it is. In 2.4.19 we have:

(vmscan.c: shrink_cache) (page_alloc.c: __free_pages)

365 if (unlikely(!page_count(page)))
366 continue;
444 if (!PageReserved(page) && put_page_testzero(page))
[many twisty paths, all different]
511 /* effectively free the page here */
512 page_cache_release(page);
445 __free_pages_ok(page, order);
[free it again just to make sure]

So there's no question that the race is lurking in 2.4. I noticed several
more paths besides the one above that look suspicious as well. The bottom
line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
something that can actually be proved.

It's a wonder that this problem manifests so rarely in practice.

--
Daniel

2002-08-28 22:37:57

by Andrew Morton

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

Daniel Phillips wrote:
>
> ...
> So there's no question that the race is lurking in 2.4. I noticed several
> more paths besides the one above that look suspicious as well. The bottom
> line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
> something that can actually be proved.
>
> It's a wonder that this problem manifests so rarely in practice.

I sort-of glanced through the 2.4 paths and it appears that in all of the
places where it could do a page_cache_get/release, that would never happen
because of other parts of the page state.

Like: it can't be in pagecache, so we won't run writepage, and
it can't have buffers, so we won't run try_to_release_page().

Of course, I might have missed a path. And, well, generally: ugh.

2002-08-28 23:13:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: MM patches against 2.5.31

On Thursday 29 August 2002 00:39, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > ...
> > So there's no question that the race is lurking in 2.4. I noticed several
> > more paths besides the one above that look suspicious as well. The bottom
> > line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
> > something that can actually be proved.
> >
> > It's a wonder that this problem manifests so rarely in practice.
>
> I sort-of glanced through the 2.4 paths and it appears that in all of the
> places where it could do a page_cache_get/release, that would never happen
> because of other parts of the page state.
>
> Like: it can't be in pagecache, so we won't run writepage, and
> it can't have buffers, so we won't run try_to_release_page().
>
> Of course, I might have missed a path. And, well, generally: ugh.

I think it is happening. I just went sifting searching through the archives
on 'oops' and '2.4'. The first one I found was:

2.4.18-xfs (xfs related?) oops report

which fits the description nicely.

The race I showed actually causes the page->count to go negative, avoiding
a double free on a technicality. That doesn't make me feel much better about
it. Have you got a BUG_ON(!page_count(page)) in put_page_testzero? I think
we might see some action.

--
Daniel

2002-08-30 22:57:10

by Daniel Phillips

[permalink] [raw]
Subject: [RFC] [PATCH] Include LRU in page count

The executive summary is 'Houston, we have a bug'. The long version of this
is that all current linux kernels are now thought to suffer from a rarely
occurring page double-free bug that results from a race between shrink_cache
and page_cache_release, where both attempt to free a page from the lru at
the same time.

Andrew's succinct description of the race is:

if (put_page_testzero(page)) {
/* window here */
lru_cache_del(page);
__free_pages_ok(page, 0);
}

versus:

spin_lock(lru lock);
page = list_entry(lru, ...);
if (page_count(page) == 0)
continue;
/* window here */
page_cache_get(page);
page_cache_release(page); /* double-free */

This rare race happened to become not so rare in 2.5 recently, and was
analyzed by Christian Ehrhardt, who also proposed a solution based on a new
approach to locking, essentially put_page_testone. We went on to check 2.4
for the race, and while it's quite difficult to prove the race exists there,
after a year or two of plugging holes, it's even more difficult to prove it's
not there. In fact, sifting through recent kernel archives turns up
unexplained bug reports which fit the description of this race, for example:

http://marc.theaimsgroup.com/?l=linux-xfs&m=103055330831850&w=2
(2.4.18-xfs (xfs related?) oops report)

I proposed an alternate solution using the traditional put_page_testzero
primitive, which relies on assigning a page count of one for membership on
the lru list. A slightly racy heuristic is used for efficient lru list
removal. The resulting incarnation of lru_cache_release is:

static inline void page_cache_release(struct page *page)
{
if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) {
if (PageLRU(page) && page_count(page) == 2)
__lru_cache_del(page);
spin_unlock(&pagemap_lru_lock);
}
put_page(page);
}

which permits the following benign race, starting with page count of 3:

if (page_count(page) == 2 /* false */
if (page_count(page) == 2 /* false */
put_page(page);
put_page(page);

Neither holder of a page reference sees the count at 2, and so the page
is left on the lru with count = 1. This won't happen often (I haven't seen
it happen yet) and such pages will be recovered from the cold end of the list
in due course. There are a number of other variations of this race, but they
all have the same effect and are all rare.

The important question is: can this code ever remove a page from the lru
erroneously, leaving somebody holding a reference to a non-lru page? In
other words, can the test PageLRU(page) && page_count(page) == 2 return a
false positive? Well, when this test is true we can account for both
both references: the one we own, and the one the lru list owns. Since
we hold the lru lock, the latter won't change. Nobody else has the right to
increment the page count, since they must inherit that right from somebody
who holds a reference, and there are none.

The spin_trylock is efficient - no contention ever, just some cache
pingponging at worst - and permits page_cache_release to be used even by a
process that holds the lru lock.

The attached patch against 2.4.19 implements these ideas, along with some
proc statistics to help verify that the intended effect is being achieved.
The patch supports both Marcelo's current design (hopefully with no semantic
changes at all) and the new lru counting strategy, selected by setting the
LRU_PLUS_CACHE to one to get the existing strategy or two to get the new one.

With the patch, there remain only three instances of __lru_cache_del: the one
in page_cache_release, the one in shrink_cache and the one in truncate. The
latter looks suspicious since it's not clear why anonymous pages can't be
allowed to appear on the lru. However, that is not the focus of the current
effort. The atomic tests for lru membership appear to be obsoleted by the
new regimen, since we are protected in every case by the lru lock, but I did
not disturb them.

Due to laziness, the correctness of the swapcache logic in swapfile.c has not
yet been verified. I've done some light testing on my 2GB 2X1Ghz PIII box,
using the following pair of scripts inherited from akpm:

file: doitlots
-----------------------
#!/bin/sh

doit()
{
( cat $1 | wc -l )
}

count=0

while [ $count != 500 ]
do
doit doitlots > /dev/null

count=$(expr $count + 1)
done
echo done
-----------------------

file: forklots
-----------------------
echo >foocount
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &

count=0

while [ $count != 10 ]
do
count=$(wc foocount | cut -b -8)
done
-----------------------

The patch computes the following statistics and makes them available via
/proc/mmstats:

page_cache_release ... Total calls to page_cache_release
page_cache_release_test_lru ... How often the lru heuristic triggers
page_cache_release_del_lru ... How often the heuristic is correct
shrink_cache_del_lru ... How often an orphan is left on the lru
put_page_testzero ... Total page releases
lru_cache_del ... Total lru deletes
truncate_lru_cache_del ... Total lru deletes during truncate

Now the benchmark results...

LRU_PLUS_CACHE = 2, with LRU_DEL_HEURISTIC:

time sh forklots
24.81user 28.15system 0:26.50elapsed 199%CPU

cat /proc/mmstats
page_cache_release 8555807
page_cache_release_test_lru 771550
page_cache_release_del_lru 771535
shrink_cache_del_lru 0
put_page_testzero 9209139
lru_cache_del 771547
truncate_lru_cache_del 12

According to the statistics, we took the lru lock to see if the page is
really on the lru about 9% of the time. In just .002% (15) of those cases
the heuristic was wrong, the lock was taken and no page removed from the lru.
No orphan pages were ever found on the lru. In this case, all the
lru_cache_dels are accounted for by the 771,535 that were removed via the
page_cache_release heuristic, and the 12 that were converted to anonymous
pages via truncate.

LRU_PLUS_CACHE = 2, without LRU_DEL_HEURISTIC:

time sh forklots
25.45user 28.70system 0:27.28elapsed 198%CPU

cat /proc/mmstats
page_cache_release 8611106
page_cache_release_test_lru 0
page_cache_release_del_lru 0
shrink_cache_del_lru 614573
put_page_testzero 9832924
lru_cache_del 11
truncate_lru_cache_del 11

Omitting the heuristic forces shrink_cache to pick up the orphaned lru pages,
which costs about 3% in terms of wall clock time on this test. At least this
exercises the orphan recovery path in shrink_cache.

LRU_PLUS_CACHE = 2, with LRU_DEL_HEURISTIC, no statistics:

time sh forklots
24.74user 28.02system 0:26.39elapsed 199%CPU

The statistics cost about .5%, which is basically noise.

Vanilla 2.4.19:

time sh forklots
24.15user 28.44system 0:26.31elapsed 199%CPU

Vanilla 2.4.19 may or may not have an advantage of .3%. On the other hand,
removing the atomic test for lru membership probably gives a slight advantage
to the new version.

The patch is against 2.4.19, apply with patch -p1:

--- 2.4.19.clean/fs/proc/proc_misc.c Fri Aug 2 20:39:45 2002
+++ 2.4.19/fs/proc/proc_misc.c Fri Aug 30 21:32:20 2002
@@ -539,6 +539,28 @@
entry->proc_fops = f;
}

+struct mmstats mmstats;
+
+char *mmnames[] = {
+ "page_cache_release",
+ "page_cache_release_test_lru",
+ "page_cache_release_del_lru",
+ "shrink_cache_del_lru",
+ "put_page_testzero",
+ "lru_cache_del",
+ "truncate_lru_cache_del",
+ NULL,
+};
+
+static int read_mmstats(char *page, char **start, off_t off, int count, int *eof, void *data)
+{
+ char **names = mmnames;
+ unsigned *stats = (unsigned *) &mmstats, len = 0;
+ for (; *names; *names++, *stats++) if (strlen(*names))
+ len += sprintf(page + len, "%s %u\n", *names, *stats);
+ return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
void __init proc_misc_init(void)
{
struct proc_dir_entry *entry;
@@ -546,6 +568,7 @@
char *name;
int (*read_proc)(char*,char**,off_t,int,int*,void*);
} *p, simple_ones[] = {
+ {"mmstats", read_mmstats},
{"loadavg", loadavg_read_proc},
{"uptime", uptime_read_proc},
{"meminfo", meminfo_read_proc},
--- 2.4.19.clean/include/linux/mm.h Fri Aug 2 20:39:45 2002
+++ 2.4.19/include/linux/mm.h Fri Aug 30 21:41:54 2002
@@ -35,6 +35,20 @@
* mmap() functions).
*/

+extern struct mmstats {
+ unsigned release;
+ unsigned release_test_page;
+ unsigned release_free_page;
+ unsigned vmscan_free_page;
+ unsigned testzero;
+ unsigned lru_cache_del;
+ unsigned truncate_lru_cache_del;
+} mmstats;
+
+extern char *mmnames[];
+
+#define mmstat(field) mmstats.field++
+
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -180,6 +194,15 @@
} mem_map_t;

/*
+ * There is only one 'core' page-freeing function.
+ */
+extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
+extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
+
+#define __free_page(page) __free_pages((page), 0)
+#define free_page(addr) free_pages((addr),0)
+
+/*
* Methods to modify the page usage count.
*
* What counts for a page usage:
@@ -191,12 +214,34 @@
* Also, many kernel routines increase the page count before a critical
* routine so they can be sure the page doesn't go away from under them.
*/
-#define get_page(p) atomic_inc(&(p)->count)
-#define put_page(p) __free_page(p)
-#define put_page_testzero(p) atomic_dec_and_test(&(p)->count)
#define page_count(p) atomic_read(&(p)->count)
#define set_page_count(p,v) atomic_set(&(p)->count, v)

+extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
+
+static inline void get_page(struct page *page)
+{
+ atomic_inc(&page->count);
+}
+
+static inline void put_page_nofree(struct page *page)
+{
+ BUG_ON(!page_count(page));
+ atomic_dec(&page->count);
+}
+
+static inline int put_page_testzero(struct page *page)
+{
+ mmstat(testzero);
+ BUG_ON(!page_count(page));
+ return atomic_dec_and_test(&page->count);
+}
+
+static inline void put_page(struct page *page)
+{
+ __free_page(page);
+}
+
/*
* Various page->flags bits:
*
@@ -451,15 +496,6 @@
*/
#define get_free_page get_zeroed_page

-/*
- * There is only one 'core' page-freeing function.
- */
-extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
-extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
-
-#define __free_page(page) __free_pages((page), 0)
-#define free_page(addr) free_pages((addr),0)
-
extern void show_free_areas(void);
extern void show_free_areas_node(pg_data_t *pgdat);

@@ -518,11 +554,6 @@

extern struct address_space swapper_space;
#define PageSwapCache(page) ((page)->mapping == &swapper_space)
-
-static inline int is_page_cache_freeable(struct page * page)
-{
- return page_count(page) - !!page->buffers == 1;
-}

extern int can_share_swap_page(struct page *);
extern int remove_exclusive_swap_page(struct page *);
--- 2.4.19.clean/include/linux/pagemap.h Mon Feb 25 14:38:13 2002
+++ 2.4.19/include/linux/pagemap.h Fri Aug 30 22:17:21 2002
@@ -27,9 +27,38 @@
#define PAGE_CACHE_SIZE PAGE_SIZE
#define PAGE_CACHE_MASK PAGE_MASK
#define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
+#define LRU_PLUS_CACHE 2
+#define LRU_DEL_HEURISTIC

#define page_cache_get(x) get_page(x)
-#define page_cache_release(x) __free_page(x)
+
+static inline void page_cache_release(struct page *page)
+{
+ mmstat(release);
+#if LRU_PLUS_CACHE==2
+#ifdef LRU_DEL_HEURISTIC
+ if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) {
+ mmstat(release_test_page);
+ if (PageLRU(page) && page_count(page) == 2) {
+ mmstat(release_free_page);
+ __lru_cache_del(page);
+ }
+ spin_unlock(&pagemap_lru_lock);
+ }
+#endif
+#endif
+ put_page(page);
+}
+
+static inline int has_buffers(struct page *page)
+{
+ return !!page->buffers;
+}
+
+static inline int is_page_cache_freeable(struct page *page)
+{
+ return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE;
+}

static inline struct page *page_cache_alloc(struct address_space *x)
{
--- 2.4.19.clean/mm/filemap.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/filemap.c Fri Aug 30 21:40:33 2002
@@ -198,10 +198,12 @@
if (page->buffers && !try_to_free_buffers(page, 0))
goto unlock;

- if (page_count(page) != 1)
+ if (page_count(page) != LRU_PLUS_CACHE)
goto unlock;

+#if LRU_PLUS_CACHE==1
__lru_cache_del(page);
+#endif
__remove_inode_page(page);
UnlockPage(page);
page_cache_release(page);
@@ -234,8 +236,10 @@
static void truncate_complete_page(struct page *page)
{
/* Leave it on the LRU if it gets converted into anonymous buffers */
- if (!page->buffers || do_flushpage(page, 0))
+ if (!page->buffers || do_flushpage(page, 0)) {
+ mmstat(truncate_lru_cache_del);
lru_cache_del(page);
+ }

/*
* We remove the page from the page cache _after_ we have
@@ -345,7 +349,7 @@
* The page is locked and we hold the pagecache_lock as well
* so both page_count(page) and page->buffers stays constant here.
*/
- if (page_count(page) == 1 + !!page->buffers) {
+ if (is_page_cache_freeable(page)) {
/* Restart after this page */
list_del(head);
list_add_tail(head, curr);
--- 2.4.19.clean/mm/page_alloc.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/page_alloc.c Fri Aug 30 21:55:03 2002
@@ -82,9 +82,12 @@
/* Yes, think what happens when other parts of the kernel take
* a reference to a page in order to pin it for io. -ben
*/
+#if LRU_PLUS_CACHE==2
+ BUG_ON(PageLRU(page));
+#else
if (PageLRU(page))
lru_cache_del(page);
-
+#endif
if (page->buffers)
BUG();
if (page->mapping)
--- 2.4.19.clean/mm/swap.c Wed Nov 7 01:44:20 2001
+++ 2.4.19/mm/swap.c Fri Aug 30 21:32:20 2002
@@ -60,6 +60,9 @@
if (!TestSetPageLRU(page)) {
spin_lock(&pagemap_lru_lock);
add_page_to_inactive_list(page);
+#if LRU_PLUS_CACHE==2
+ get_page(page);
+#endif
spin_unlock(&pagemap_lru_lock);
}
}
@@ -79,6 +82,10 @@
} else {
del_page_from_inactive_list(page);
}
+ mmstat(lru_cache_del);
+#if LRU_PLUS_CACHE==2
+ put_page_nofree(page);
+#endif
}
}

--- 2.4.19.clean/mm/swapfile.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/swapfile.c Fri Aug 30 21:32:20 2002
@@ -239,7 +239,7 @@
if (p->swap_map[SWP_OFFSET(entry)] == 1) {
/* Recheck the page count with the pagecache lock held.. */
spin_lock(&pagecache_lock);
- if (page_count(page) - !!page->buffers == 2)
+ if (page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1)
retval = 1;
spin_unlock(&pagecache_lock);
}
@@ -263,16 +263,16 @@
if (!PageLocked(page))
BUG();
switch (page_count(page)) {
- case 3:
- if (!page->buffers)
+ case LRU_PLUS_CACHE + 2:
+ if (!has_buffers(page))
break;
/* Fallthrough */
- case 2:
+ case LRU_PLUS_CACHE + 1:
if (!PageSwapCache(page))
break;
retval = exclusive_swap_page(page);
break;
- case 1:
+ case LRU_PLUS_CACHE:
if (PageReserved(page))
break;
retval = 1;
@@ -280,6 +280,11 @@
return retval;
}

+static inline int only_cached(struct page *page)
+{
+ return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1;
+}
+
/*
* Work out if there are any other processes sharing this
* swap cache page. Free it if you can. Return success.
@@ -294,7 +299,7 @@
BUG();
if (!PageSwapCache(page))
return 0;
- if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */
+ if (!only_cached(page))
return 0;

entry.val = page->index;
@@ -307,7 +312,7 @@
if (p->swap_map[SWP_OFFSET(entry)] == 1) {
/* Recheck the page count with the pagecache lock held.. */
spin_lock(&pagecache_lock);
- if (page_count(page) - !!page->buffers == 2) {
+ if (only_cached(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
@@ -343,7 +348,7 @@
if (page) {
page_cache_get(page);
/* Only cache user (+us), or swap space full? Free it! */
- if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) {
+ if (only_cached(page) || vm_swap_full()) {
delete_from_swap_cache(page);
SetPageDirty(page);
}
--- 2.4.19.clean/mm/vmscan.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/vmscan.c Fri Aug 30 21:40:30 2002
@@ -89,7 +89,7 @@
mm->rss--;
UnlockPage(page);
{
- int freeable = page_count(page) - !!page->buffers <= 2;
+ int freeable = page_count(page) - has_buffers(page) <= LRU_PLUS_CACHE + 1;
page_cache_release(page);
return freeable;
}
@@ -356,20 +356,30 @@
BUG_ON(PageActive(page));

list_del(entry);
+#if LRU_PLUS_CACHE==2
+ BUG_ON(!page_count(page));
+ if (unlikely(page_count(page) == 1)) {
+ mmstat(vmscan_free_page);
+ BUG_ON(!TestClearPageLRU(page)); // side effect abuse!!
+ put_page(page);
+ continue;
+ }
+#endif
list_add(entry, &inactive_list);

+#if LRU_PLUS_CACHE==1
/*
* Zero page counts can happen because we unlink the pages
* _after_ decrementing the usage count..
*/
if (unlikely(!page_count(page)))
continue;
-
+#endif
if (!memclass(page_zone(page), classzone))
continue;

/* Racy check to avoid trylocking when not worthwhile */
- if (!page->buffers && (page_count(page) != 1 || !page->mapping))
+ if (!page->buffers && (page_count(page) != LRU_PLUS_CACHE || !page->mapping))
goto page_mapped;

/*
@@ -434,9 +444,9 @@
*/
spin_lock(&pagemap_lru_lock);
UnlockPage(page);
+#if LRU_PLUS_CACHE==1
__lru_cache_del(page);
-
- /* effectively free the page here */
+#endif
page_cache_release(page);

if (--nr_pages)
@@ -505,10 +515,10 @@
swap_free(swap);
}

+#if LRU_PLUS_CACHE==1
__lru_cache_del(page);
+#endif
UnlockPage(page);
-
- /* effectively free the page here */
page_cache_release(page);

if (--nr_pages)

2002-08-31 16:10:25

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Sat, Aug 31, 2002 at 01:03:07AM +0200, Daniel Phillips wrote:
> This rare race happened to become not so rare in 2.5 recently, and was
> analyzed by Christian Ehrhardt, who also proposed a solution based on a new
> approach to locking, essentially put_page_testone. We went on to check 2.4

Just a little correction: The key function implemented in my solution
is an atomic GET_page_testone which is called if the page might have
a zero refcount. The idea I had in mind is to distinguish heavy and weak
references to pages. Your solution is probably the better way to go.

> I proposed an alternate solution using the traditional put_page_testzero
> primitive, which relies on assigning a page count of one for membership on
> the lru list. A slightly racy heuristic is used for efficient lru list
> removal. The resulting incarnation of lru_cache_release is:
>
> static inline void page_cache_release(struct page *page)
> {
> if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) {
> if (PageLRU(page) && page_count(page) == 2)
> __lru_cache_del(page);
> spin_unlock(&pagemap_lru_lock);
> }
> put_page(page);
> }

Just saw that this can still race e.g. with lru_cache_add (not
hard to fix though):

| void lru_cache_add(struct page * page)
| {
| if (!TestSetPageLRU(page)) {

Window is here: Once we set the PageLRU bit page_cache_release
assumes that there is a reference held by the lru cache.

| spin_lock(&pagemap_lru_lock);
| add_page_to_inactive_list(page);
|#if LRU_PLUS_CACHE==2
| get_page(page);
|#endif

But only starting at this point the reference actually exists.

| spin_unlock(&pagemap_lru_lock);
| }
|}

Solution: Change the PageLRU bit inside the lock. Looks like
lru_cache_add is the only place that doesn't hold the lru lock to
change the LRU flag and it's probably not a good idea even without
the patch.

Two more comments: I don't think it is a good idea to use
put_page_nofree in __lru_cache_del. This is probably safe now but
it adds an additional rule that lru_cache_del can't be called without
holding a second reference to the page.
Also there may be lru only pages on the active list, i.e. refill
inactive should have this hunk as well:

> +#if LRU_PLUS_CACHE==2
> + BUG_ON(!page_count(page));
> + if (unlikely(page_count(page) == 1)) {
> + mmstat(vmscan_free_page);
> + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!!
> + put_page(page);
> + continue;
> + }
> +#endif

regards Christian Ehrhardt

--
THAT'S ALL FOLKS!

2002-08-31 17:37:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

Christian Ehrhardt wrote:
>
> ...
> Solution: Change the PageLRU bit inside the lock. Looks like
> lru_cache_add is the only place that doesn't hold the lru lock to
> change the LRU flag and it's probably not a good idea even without
> the patch.
>

2.4.20-pre already does that:

void lru_cache_add(struct page * page)
{
if (!PageLRU(page)) {
spin_lock(&pagemap_lru_lock);
if (!TestSetPageLRU(page))
add_page_to_inactive_list(page);
spin_unlock(&pagemap_lru_lock);
}
}

there was an oopsing race with activate_page()...

2002-08-31 19:58:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Saturday 31 August 2002 18:14, Christian Ehrhardt wrote:
> On Sat, Aug 31, 2002 at 01:03:07AM +0200, Daniel Phillips wrote:
> > static inline void page_cache_release(struct page *page)
> > {
> > if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) {
> > if (PageLRU(page) && page_count(page) == 2)
> > __lru_cache_del(page);
> > spin_unlock(&pagemap_lru_lock);
> > }
> > put_page(page);
> > }
>
> Just saw that this can still race e.g. with lru_cache_add (not
> hard to fix though):
>
> | void lru_cache_add(struct page * page)
> | {
> | if (!TestSetPageLRU(page)) {
>
> Window is here: Once we set the PageLRU bit page_cache_release
> assumes that there is a reference held by the lru cache.
>
> | spin_lock(&pagemap_lru_lock);
> | add_page_to_inactive_list(page);
> |#if LRU_PLUS_CACHE==2
> | get_page(page);
> |#endif
>
> But only starting at this point the reference actually exists.
>
> | spin_unlock(&pagemap_lru_lock);
> | }
> |}
>
> Solution: Change the PageLRU bit inside the lock. Looks like
> lru_cache_add is the only place that doesn't hold the lru lock to
> change the LRU flag and it's probably not a good idea even without
> the patch.

Thanks. I sniffed at that very thing a little yesterday and decided that
lru_cache_add must always be called on a page that isn't on the lru. It's
really hard to prove that though - just look at the acres of sewage in
filemap.c and the various creepy-crawly things that call put_dirty_page.

Coding it so the race isn't even possible is a much better idea. In the
process of getting rid of the TestSet and TestClear bitops, I'd
inadvertently done that already, this way:

void lru_cache_add(struct page * page)
{
spin_lock(&pagemap_lru_lock);
BUG_ON(PageLRU(page));
SetPageLRU(page);
get_page(page);
add_page_to_inactive_list(page);
spin_unlock(&pagemap_lru_lock);
}

So at least we get an easy-to-understand BUG instead of a much harder
to find side effect. That said, the assumption that a page is never
on the lru at the time of this call is too fragile at best and likely
wrong - see shmem_writepage->add_to_page_cache_locked.

So we want:

void lru_cache_add(struct page * page)
{
spin_lock(&pagemap_lru_lock);
if (likely(!PageLRU(page))) {
add_page_to_inactive_list(page);
SetPageLRU(page);
get_page(page);
}
spin_unlock(&pagemap_lru_lock);
}

> Two more comments: I don't think it is a good idea to use
> put_page_nofree in __lru_cache_del. This is probably safe now but
> it adds an additional rule that lru_cache_del can't be called without
> holding a second reference to the page.

That was intentional. We always do hold an additional reference where
this is used, and if that changes in the future the author better at
least take a look at the rules. Another way of looking at it: since
all the code paths that use this rely on the page staying around for
further operations after the lru delete, it is a bug for sure if the
count goes to zero during the lru delete.

I'd call that one a documentation bug.

> Also there may be lru only pages on the active list, i.e. refill
> inactive should have this hunk as well:
>
> > +#if LRU_PLUS_CACHE==2
> > + BUG_ON(!page_count(page));
> > + if (unlikely(page_count(page) == 1)) {
> > + mmstat(vmscan_free_page);
> > + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!!
> > + put_page(page);
> > + continue;
> > + }
> > +#endif

If we have orphans on the active list, we'd probably better just count
them and figure out what we're doing wrong to put them there in the first
place. In time they will migrate to the inactive list and get cleaned
up.

Yesterday's patch had a pretty big performance bug: at the bottom of the
shrink_cache loop where a cache page is finally removed, I had:

UnlockPage(page);
page_cache_release(page);

but we hold the lru lock there, so the page always becomes an orphan, to
be picked up later. The sensible thing to do is:

__lru_cache_del(page);
UnlockPage(page);
put_page(page);

Besides that, I cleaned up yesterdays side-effecty BUG_ON, and removed the
TestSet/Clear to just Set/ClearPageLRU. It's tempting to do a lot more,
but with some effort I resisted that.

Manfred suggested an approach to de-racing this race using
atomic_dec_and_lock, which needs to be compared to the current approach.
More on that later.

--
Daniel

--- 2.4.19.clean/fs/proc/proc_misc.c Fri Aug 2 20:39:45 2002
+++ 2.4.19/fs/proc/proc_misc.c Fri Aug 30 21:32:20 2002
@@ -539,6 +539,28 @@
entry->proc_fops = f;
}

+struct mmstats mmstats;
+
+char *mmnames[] = {
+ "page_cache_release",
+ "page_cache_release_test_lru",
+ "page_cache_release_del_lru",
+ "shrink_cache_del_lru",
+ "put_page_testzero",
+ "lru_cache_del",
+ "truncate_lru_cache_del",
+ NULL,
+};
+
+static int read_mmstats(char *page, char **start, off_t off, int count, int *eof, void *data)
+{
+ char **names = mmnames;
+ unsigned *stats = (unsigned *) &mmstats, len = 0;
+ for (; *names; *names++, *stats++) if (strlen(*names))
+ len += sprintf(page + len, "%s %u\n", *names, *stats);
+ return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
void __init proc_misc_init(void)
{
struct proc_dir_entry *entry;
@@ -546,6 +568,7 @@
char *name;
int (*read_proc)(char*,char**,off_t,int,int*,void*);
} *p, simple_ones[] = {
+ {"mmstats", read_mmstats},
{"loadavg", loadavg_read_proc},
{"uptime", uptime_read_proc},
{"meminfo", meminfo_read_proc},
--- 2.4.19.clean/include/linux/mm.h Fri Aug 2 20:39:45 2002
+++ 2.4.19/include/linux/mm.h Sat Aug 31 18:56:17 2002
@@ -35,6 +35,20 @@
* mmap() functions).
*/

+extern struct mmstats {
+ unsigned release;
+ unsigned release_test_page;
+ unsigned release_free_page;
+ unsigned vmscan_free_page;
+ unsigned testzero;
+ unsigned lru_cache_del;
+ unsigned truncate_lru_cache_del;
+} mmstats;
+
+extern char *mmnames[];
+
+#define mmstat(field) mmstats.field++
+
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -180,6 +194,15 @@
} mem_map_t;

/*
+ * There is only one 'core' page-freeing function.
+ */
+extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
+extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
+
+#define __free_page(page) __free_pages((page), 0)
+#define free_page(addr) free_pages((addr),0)
+
+/*
* Methods to modify the page usage count.
*
* What counts for a page usage:
@@ -191,12 +214,34 @@
* Also, many kernel routines increase the page count before a critical
* routine so they can be sure the page doesn't go away from under them.
*/
-#define get_page(p) atomic_inc(&(p)->count)
-#define put_page(p) __free_page(p)
-#define put_page_testzero(p) atomic_dec_and_test(&(p)->count)
#define page_count(p) atomic_read(&(p)->count)
#define set_page_count(p,v) atomic_set(&(p)->count, v)

+extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
+
+static inline void get_page(struct page *page)
+{
+ atomic_inc(&page->count);
+}
+
+static inline void put_page_nofree(struct page *page)
+{
+ BUG_ON(!page_count(page));
+ atomic_dec(&page->count);
+}
+
+static inline int put_page_testzero(struct page *page)
+{
+ mmstat(testzero);
+ BUG_ON(!page_count(page));
+ return atomic_dec_and_test(&page->count);
+}
+
+static inline void put_page(struct page *page)
+{
+ __free_page(page);
+}
+
/*
* Various page->flags bits:
*
@@ -394,6 +439,8 @@
#define ClearPageActive(page) clear_bit(PG_active, &(page)->flags)

#define PageLRU(page) test_bit(PG_lru, &(page)->flags)
+#define SetPageLRU(page) set_bit(PG_lru, &(page)->flags)
+#define ClearPageLRU(page) clear_bit(PG_lru, &(page)->flags)
#define TestSetPageLRU(page) test_and_set_bit(PG_lru, &(page)->flags)
#define TestClearPageLRU(page) test_and_clear_bit(PG_lru, &(page)->flags)

@@ -451,15 +498,6 @@
*/
#define get_free_page get_zeroed_page

-/*
- * There is only one 'core' page-freeing function.
- */
-extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
-extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
-
-#define __free_page(page) __free_pages((page), 0)
-#define free_page(addr) free_pages((addr),0)
-
extern void show_free_areas(void);
extern void show_free_areas_node(pg_data_t *pgdat);

@@ -518,11 +556,6 @@

extern struct address_space swapper_space;
#define PageSwapCache(page) ((page)->mapping == &swapper_space)
-
-static inline int is_page_cache_freeable(struct page * page)
-{
- return page_count(page) - !!page->buffers == 1;
-}

extern int can_share_swap_page(struct page *);
extern int remove_exclusive_swap_page(struct page *);
--- 2.4.19.clean/include/linux/pagemap.h Mon Feb 25 14:38:13 2002
+++ 2.4.19/include/linux/pagemap.h Sat Aug 31 18:58:43 2002
@@ -27,9 +27,38 @@
#define PAGE_CACHE_SIZE PAGE_SIZE
#define PAGE_CACHE_MASK PAGE_MASK
#define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
+#define LRU_PLUS_CACHE 2
+#define LRU_DEL_HEURISTIC

#define page_cache_get(x) get_page(x)
-#define page_cache_release(x) __free_page(x)
+
+static inline void page_cache_release(struct page *page)
+{
+ mmstat(release);
+#if LRU_PLUS_CACHE==2
+#ifdef LRU_DEL_HEURISTIC
+ if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) {
+ mmstat(release_test_page);
+ if (PageLRU(page) && page_count(page) == 2) {
+ mmstat(release_free_page);
+ __lru_cache_del(page);
+ }
+ spin_unlock(&pagemap_lru_lock);
+ }
+#endif
+#endif
+ put_page(page);
+}
+
+static inline int has_buffers(struct page *page)
+{
+ return !!page->buffers;
+}
+
+static inline int is_page_cache_freeable(struct page *page)
+{
+ return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE;
+}

static inline struct page *page_cache_alloc(struct address_space *x)
{
--- 2.4.19.clean/include/linux/swap.h Fri Aug 2 20:39:46 2002
+++ 2.4.19/include/linux/swap.h Sat Aug 31 18:56:11 2002
@@ -168,17 +168,9 @@
* List add/del helper macros. These must be called
* with the pagemap_lru_lock held!
*/
-#define DEBUG_LRU_PAGE(page) \
-do { \
- if (!PageLRU(page)) \
- BUG(); \
- if (PageActive(page)) \
- BUG(); \
-} while (0)
-
#define add_page_to_active_list(page) \
do { \
- DEBUG_LRU_PAGE(page); \
+ BUG_ON(PageActive(page)); \
SetPageActive(page); \
list_add(&(page)->lru, &active_list); \
nr_active_pages++; \
@@ -186,13 +178,13 @@

#define add_page_to_inactive_list(page) \
do { \
- DEBUG_LRU_PAGE(page); \
list_add(&(page)->lru, &inactive_list); \
nr_inactive_pages++; \
} while (0)

#define del_page_from_active_list(page) \
do { \
+ BUG_ON(!PageActive(page)); \
list_del(&(page)->lru); \
ClearPageActive(page); \
nr_active_pages--; \
--- 2.4.19.clean/mm/filemap.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/filemap.c Fri Aug 30 21:40:33 2002
@@ -198,10 +198,12 @@
if (page->buffers && !try_to_free_buffers(page, 0))
goto unlock;

- if (page_count(page) != 1)
+ if (page_count(page) != LRU_PLUS_CACHE)
goto unlock;

+#if LRU_PLUS_CACHE==1
__lru_cache_del(page);
+#endif
__remove_inode_page(page);
UnlockPage(page);
page_cache_release(page);
@@ -234,8 +236,10 @@
static void truncate_complete_page(struct page *page)
{
/* Leave it on the LRU if it gets converted into anonymous buffers */
- if (!page->buffers || do_flushpage(page, 0))
+ if (!page->buffers || do_flushpage(page, 0)) {
+ mmstat(truncate_lru_cache_del);
lru_cache_del(page);
+ }

/*
* We remove the page from the page cache _after_ we have
@@ -345,7 +349,7 @@
* The page is locked and we hold the pagecache_lock as well
* so both page_count(page) and page->buffers stays constant here.
*/
- if (page_count(page) == 1 + !!page->buffers) {
+ if (is_page_cache_freeable(page)) {
/* Restart after this page */
list_del(head);
list_add_tail(head, curr);
--- 2.4.19.clean/mm/page_alloc.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/page_alloc.c Fri Aug 30 21:55:03 2002
@@ -82,9 +82,12 @@
/* Yes, think what happens when other parts of the kernel take
* a reference to a page in order to pin it for io. -ben
*/
+#if LRU_PLUS_CACHE==2
+ BUG_ON(PageLRU(page));
+#else
if (PageLRU(page))
lru_cache_del(page);
-
+#endif
if (page->buffers)
BUG();
if (page->mapping)
--- 2.4.19.clean/mm/swap.c Wed Nov 7 01:44:20 2001
+++ 2.4.19/mm/swap.c Sat Aug 31 18:56:07 2002
@@ -57,11 +57,31 @@
*/
void lru_cache_add(struct page * page)
{
+#if LRU_PLUS_CACHE==2
+ spin_lock(&pagemap_lru_lock);
+ if (likely(!PageLRU(page))) {
+ add_page_to_inactive_list(page);
+ SetPageLRU(page);
+ get_page(page);
+ }
+ spin_unlock(&pagemap_lru_lock);
+#else
if (!TestSetPageLRU(page)) {
spin_lock(&pagemap_lru_lock);
add_page_to_inactive_list(page);
spin_unlock(&pagemap_lru_lock);
}
+#endif
+}
+
+static inline void del_page_from_lru_list(struct page *page)
+{
+ BUG_ON(!PageLRU(page));
+ if (PageActive(page))
+ del_page_from_active_list(page);
+ else
+ del_page_from_inactive_list(page);
+ mmstat(lru_cache_del);
}

/**
@@ -73,13 +93,14 @@
*/
void __lru_cache_del(struct page * page)
{
- if (TestClearPageLRU(page)) {
- if (PageActive(page)) {
- del_page_from_active_list(page);
- } else {
- del_page_from_inactive_list(page);
- }
- }
+#if LRU_PLUS_CACHE==2
+ del_page_from_lru_list(page);
+ ClearPageLRU(page);
+ put_page_nofree(page);
+#else
+ if (TestClearPageLRU(page))
+ del_page_from_lru_list(page);
+#endif
}

/**
--- 2.4.19.clean/mm/swapfile.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/swapfile.c Fri Aug 30 21:32:20 2002
@@ -239,7 +239,7 @@
if (p->swap_map[SWP_OFFSET(entry)] == 1) {
/* Recheck the page count with the pagecache lock held.. */
spin_lock(&pagecache_lock);
- if (page_count(page) - !!page->buffers == 2)
+ if (page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1)
retval = 1;
spin_unlock(&pagecache_lock);
}
@@ -263,16 +263,16 @@
if (!PageLocked(page))
BUG();
switch (page_count(page)) {
- case 3:
- if (!page->buffers)
+ case LRU_PLUS_CACHE + 2:
+ if (!has_buffers(page))
break;
/* Fallthrough */
- case 2:
+ case LRU_PLUS_CACHE + 1:
if (!PageSwapCache(page))
break;
retval = exclusive_swap_page(page);
break;
- case 1:
+ case LRU_PLUS_CACHE:
if (PageReserved(page))
break;
retval = 1;
@@ -280,6 +280,11 @@
return retval;
}

+static inline int only_cached(struct page *page)
+{
+ return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1;
+}
+
/*
* Work out if there are any other processes sharing this
* swap cache page. Free it if you can. Return success.
@@ -294,7 +299,7 @@
BUG();
if (!PageSwapCache(page))
return 0;
- if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */
+ if (!only_cached(page))
return 0;

entry.val = page->index;
@@ -307,7 +312,7 @@
if (p->swap_map[SWP_OFFSET(entry)] == 1) {
/* Recheck the page count with the pagecache lock held.. */
spin_lock(&pagecache_lock);
- if (page_count(page) - !!page->buffers == 2) {
+ if (only_cached(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
@@ -343,7 +348,7 @@
if (page) {
page_cache_get(page);
/* Only cache user (+us), or swap space full? Free it! */
- if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) {
+ if (only_cached(page) || vm_swap_full()) {
delete_from_swap_cache(page);
SetPageDirty(page);
}
--- 2.4.19.clean/mm/vmscan.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/vmscan.c Sat Aug 31 19:15:20 2002
@@ -89,7 +89,7 @@
mm->rss--;
UnlockPage(page);
{
- int freeable = page_count(page) - !!page->buffers <= 2;
+ int freeable = page_count(page) - has_buffers(page) <= LRU_PLUS_CACHE + 1;
page_cache_release(page);
return freeable;
}
@@ -356,20 +356,31 @@
BUG_ON(PageActive(page));

list_del(entry);
+#if LRU_PLUS_CACHE==2
+ BUG_ON(!page_count(page));
+ if (unlikely(page_count(page) == 1)) {
+ mmstat(vmscan_free_page);
+ BUG_ON(!PageLRU(page));
+ ClearPageLRU(page);
+ put_page(page);
+ continue;
+ }
+#endif
list_add(entry, &inactive_list);

+#if LRU_PLUS_CACHE==1
/*
* Zero page counts can happen because we unlink the pages
* _after_ decrementing the usage count..
*/
if (unlikely(!page_count(page)))
continue;
-
+#endif
if (!memclass(page_zone(page), classzone))
continue;

/* Racy check to avoid trylocking when not worthwhile */
- if (!page->buffers && (page_count(page) != 1 || !page->mapping))
+ if (!page->buffers && (page_count(page) != LRU_PLUS_CACHE || !page->mapping))
goto page_mapped;

/*
@@ -434,9 +445,9 @@
*/
spin_lock(&pagemap_lru_lock);
UnlockPage(page);
+#if LRU_PLUS_CACHE==1
__lru_cache_del(page);
-
- /* effectively free the page here */
+#endif
page_cache_release(page);

if (--nr_pages)
@@ -507,9 +518,7 @@

__lru_cache_del(page);
UnlockPage(page);
-
- /* effectively free the page here */
- page_cache_release(page);
+ put_page(page);

if (--nr_pages)
continue;

2002-08-31 20:10:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

Daniel Phillips wrote:
>
> Manfred suggested an approach to de-racing this race using
> atomic_dec_and_lock, which needs to be compared to the current approach.

Could simplify things, but not all architectures have an optimised
version. So ia64, mips, parisc and s390 would end up taking
the lru lock on every page_cache_release.

2002-08-31 21:20:35

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Saturday 31 August 2002 22:26, Andrew Morton wrote:
> Daniel Phillips wrote:
> > Manfred suggested an approach to de-racing this race using
> > atomic_dec_and_lock, which needs to be compared to the current approach.
>
> Could simplify things, but not all architectures have an optimised
> version. So ia64, mips, parisc and s390 would end up taking
> the lru lock on every page_cache_release.

As far as implementing it goes, some instances of page_cache_release are
already under the lru lock and would need an alternative, non-locking
version.

The current patch seems satisfactory performance-wise and if it's
also raceless as it's supposed to be, it gives us something that works,
and we can evaluate alternatives at our leisure. Right now I'm afraid
we have something that just works most of the time.

I think we're getting to the point where this needs to get some heavy
beating up, to see what happens.

--
Daniel

2002-08-31 22:29:22

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Sat, Aug 31, 2002 at 11:05:02PM +0200, Daniel Phillips wrote:
> The current patch seems satisfactory performance-wise and if it's
> also raceless as it's supposed to be, it gives us something that works,
> and we can evaluate alternatives at our leisure. Right now I'm afraid
> we have something that just works most of the time.
> I think we're getting to the point where this needs to get some heavy
> beating up, to see what happens.

It's not going to get much heavier than how I'm beating on it.
Although it seems my box is mighty bored during 64 simultaneous
tiobench 256's (i.e. 16384 tasks). I got bored & compiled a kernel:

make -j64 bzImage 304.60s user 848.70s system 694% cpu 2:46.05 total

The cpus are 95+% idle except for when I touch /proc/, where the task
fishing around /proc/ gets stuck spinning hard in the kernel for
anywhere from 30 minutes to several hours before killing it succeeds.
It didn't quite finish the run, as the tty deadlock happened again. The
VM doesn't appear to be oopsing, though I should slap on the OOM fixes.


Cheers,
Bill

2002-09-01 13:26:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Sunday 01 September 2002 00:30, William Lee Irwin III wrote:
> On Sat, Aug 31, 2002 at 11:05:02PM +0200, Daniel Phillips wrote:
> > The current patch seems satisfactory performance-wise and if it's
> > also raceless as it's supposed to be, it gives us something that works,
> > and we can evaluate alternatives at our leisure. Right now I'm afraid
> > we have something that just works most of the time.
> > I think we're getting to the point where this needs to get some heavy
> > beating up, to see what happens.
>
> It's not going to get much heavier than how I'm beating on it.
> Although it seems my box is mighty bored during 64 simultaneous
> tiobench 256's (i.e. 16384 tasks). I got bored & compiled a kernel:
>
> make -j64 bzImage 304.60s user 848.70s system 694% cpu 2:46.05 total
>
> The cpus are 95+% idle except for when I touch /proc/, where the task
> fishing around /proc/ gets stuck spinning hard in the kernel for
> anywhere from 30 minutes to several hours before killing it succeeds.
> It didn't quite finish the run, as the tty deadlock happened again. The
> VM doesn't appear to be oopsing, though I should slap on the OOM fixes.

Is that good or bad? It sounds like the VM cycle is doing what it's
supposed to. I've been running lots_of_forks for about 5 hours and
things have reached a steady state with free memory slowly racheting
down 1% then, apparently in a sudden fit of scanning, jumping to 2.5%.
Pretty normal looking behaviour, though I don't much like the way the
free list refill is so sporadic. I can't think of any reason why that is
good. It doesn't cost a lot of cpu though. There's been a little bit of
(useless) swapping, so I guess I can consider that path at least somewhat
tested.

Every now and then, instead of free jumping up to 2.5% it jumps up to
over 10%. I doubt that's my fault. Even though I'm missing an if
(--nr_pages) at the orphan collection point, there aren't enough orphans
to cause that effect - about 3 million out of 12 billion lru_cache_dels.

The orphans appear during or shortly after the episodes of free list
filling, as you would expect, since page_cache_release fails the trylock
whenever shrink_cache is active. There are lots of parallel
page_cache_releases going on, but they don't seem to create any
noticable number of orphans.

I think what's happening normally is this:

if (page_count(page) == 2 /* no, it's 3 */
put_page(page);
if (page_count(page) == 2 /* now it is */
if (PageLRU(page) && page_count(page) == 2)
__lru_cache_del(page);
as opposed to this:

if (page_count(page) == 2 /* no, it's 3 */
if (page_count(page) == 2 /* no, still 3 */
put_page(page);
put_page(page);

which requires hitting a 3 instruction window or on the same page.
It's not going to happen often.

Statm_pte_range looks like it's going to get even more confused about
what constitutes a shared page than it already is. The solution to
this is to examine the rmap.

--
Daniel

2002-09-01 21:25:31

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Saturday 31 August 2002 22:26, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > Manfred suggested an approach to de-racing this race using
> > atomic_dec_and_lock, which needs to be compared to the current approach.
>
> Could simplify things, but not all architectures have an optimised
> version. So ia64, mips, parisc and s390 would end up taking
> the lru lock on every page_cache_release.

I've put some more thought into this and I don't see any real problem with
the atomic_dec_and_lock strategy. The only efficiency drawback is the extra
locked cmpxchg on every page count dec, and that most likely tips the
efficiency advantage ever so slightly in favor of my current strategy.

Something else I like about the current strategy is the way the trylock
avoids contention - if shrink_cache is active it just leaves the page on the
lru for collection later. Sweeping up the orphans is efficient, since the
lru lock is batched by shrink_cache.

I think I may be even able to make this all work without holding the extra
count, but I'll treat that as a background project. The important thing is,
the page reaping cycle was never correct before, now it just might be.

I'm looking at your spinlock_irq now and thinking the _irq part could
possibly be avoided. Can you please remind me of the motivation for this -
was it originally intended to address the same race we've been working on
here?

I can see the advantage of being able to take the lru lock from interrupt
context, but we may be able to achieve the intended effect with a trylock.

--
Daniel

2002-09-01 21:53:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

Daniel Phillips wrote:
>
> ...
> I'm looking at your spinlock_irq now and thinking the _irq part could
> possibly be avoided. Can you please remind me of the motivation for this -
> was it originally intended to address the same race we've been working on
> here?
>

scalability, mainly. If the CPU holding the lock takes an interrupt,
all the other CPUs get to spin until the handler completes. I measured
a 30% reducton in contention from this.

Not a generally justifiable trick, but this is a heavily-used lock.
All the new games in refill_inactive() are there to minimise the
interrupt-off time.

2002-09-01 22:23:54

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Monday 02 September 2002 00:09, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > ...
> > I'm looking at your spinlock_irq now and thinking the _irq part could
> > possibly be avoided. Can you please remind me of the motivation for this -
> > was it originally intended to address the same race we've been working on
> > here?
> >
>
> scalability, mainly. If the CPU holding the lock takes an interrupt,
> all the other CPUs get to spin until the handler completes. I measured
> a 30% reducton in contention from this.
>
> Not a generally justifiable trick, but this is a heavily-used lock.
> All the new games in refill_inactive() are there to minimise the
> interrupt-off time.

Per-cpu lru lists? Your solution is a lot simpler...

--
Daniel

2002-09-01 22:36:15

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Monday 02 September 2002 00:09, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > ...
> > I'm looking at your spinlock_irq now and thinking the _irq part could
> > possibly be avoided. Can you please remind me of the motivation for this -
> > was it originally intended to address the same race we've been working on
> > here?
>
> scalability, mainly. If the CPU holding the lock takes an interrupt,
> all the other CPUs get to spin until the handler completes. I measured
> a 30% reducton in contention from this.
>
> Not a generally justifiable trick, but this is a heavily-used lock.
> All the new games in refill_inactive() are there to minimise the
> interrupt-off time.

Ick. I hope you really chopped the lock hold time into itty-bitty pieces.

Note that I changed the spin_lock in page_cache_release to a trylock, maybe
it's worth checking out the effect on contention. With a little head
scratching we might be able to get rid of the spin_lock in lru_cache_add as
well. That leaves (I think) just the two big scan loops. I've always felt
it's silly to run more than one of either at the same time anyway.

--
Daniel

2002-09-01 22:52:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

Daniel Phillips wrote:
>
> On Monday 02 September 2002 00:09, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > ...
> > > I'm looking at your spinlock_irq now and thinking the _irq part could
> > > possibly be avoided. Can you please remind me of the motivation for this -
> > > was it originally intended to address the same race we've been working on
> > > here?
> >
> > scalability, mainly. If the CPU holding the lock takes an interrupt,
> > all the other CPUs get to spin until the handler completes. I measured
> > a 30% reducton in contention from this.
> >
> > Not a generally justifiable trick, but this is a heavily-used lock.
> > All the new games in refill_inactive() are there to minimise the
> > interrupt-off time.
>
> Ick. I hope you really chopped the lock hold time into itty-bitty pieces.

Max hold is around 7,000 cycles.

> Note that I changed the spin_lock in page_cache_release to a trylock, maybe
> it's worth checking out the effect on contention. With a little head
> scratching we might be able to get rid of the spin_lock in lru_cache_add as
> well. That leaves (I think) just the two big scan loops. I've always felt
> it's silly to run more than one of either at the same time anyway.

No way. Take a look at http://samba.org/~anton/linux/2.5.30/

That's 8-way power4, the workload is "dd from 7 disks
dd if=/dev/sd* of=/dev/null bs=1024k".

The CPU load in this situation was dominated by the VM. The LRU list and page
reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is
pretty gross.

I think it's under control now - I expect we're OK up to eight or sixteen
CPUs per node, but I'm still not happy with the level of testing. Most people
who have enough hardware to test this tend to have so much RAM that their
tests don't hit page reclaim.

slablru will probably change the picture too. There's a weird effect with
the traditional try_to_free_pages() code. The first thing it does is to run
kmem_cache_shrink(). Which takes a sempahore, fruitlessly fiddles with the
slab and then runs page reclaim.

On the 4-way I measured 25% contention on the spinlock inside that semaphore.
What is happening is that the combination of the sleeping lock and the slab
operations effectively serialises entry into page reclaim.

slablru removes that little turnstile on entry to try_to_free_pages(), and we
will now see a lot more concurrency in shrink_foo().

My approach was to keep the existing design and warm it up, rather than to
redesign. Is it "good enough" now? Dunno - nobody has run the tests
with slablru. But it's probably too late for a redesign (such as per-cpu LRU,
per-mapping lru, etc).

It would be great to make presence on the LRU contribute to page->count, because
that would permit the removal of a ton of page_cache_get/release operations inside
the LRU lock, perhaps doubling throughput in there. Guess I should get off my
lazy butt and see what you've done (will you for heaven's sake go and buy an IDE
disk and compile up a 2.5 kernel? :))

2002-09-01 23:28:48

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Sun, Sep 01, 2002 at 04:08:03PM -0700, Andrew Morton wrote:
> My approach was to keep the existing design and warm it up, rather than to
> redesign. Is it "good enough" now? Dunno - nobody has run the tests
> with slablru. But it's probably too late for a redesign (such as per-cpu LRU,
> per-mapping lru, etc).
>
> It would be great to make presence on the LRU contribute to
> page->count, because that would permit the removal of a ton of
> page_cache_get/release operations inside the LRU lock, perhaps
> doubling throughput in there. Guess I should get off my lazy butt
> and see what you've done (will you for heaven's sake go and buy an IDE
> disk and compile up a 2.5 kernel? :))


I've at least been testing (if not at least attempting to debug)
2.5.32-mm2, which includes slablru.


Cheers,
Bill

2002-09-01 23:35:06

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Monday 02 September 2002 01:08, Andrew Morton wrote:
> Daniel Phillips wrote:
> It would be great to make presence on the LRU contribute to page->count, because
> that would permit the removal of a ton of page_cache_get/release operations inside
> the LRU lock, perhaps doubling throughput in there.

It's pretty much done, it just needs some torture testing:

http://people.nl.linux.org/~phillips/patches/lru.race-2.4.19

The handy /proc/mmstat ad hoc statistics are still in there. The patch supports
both zero and one-flavors of lru contribution to page count for the time being,
with the latter as the default.

> Guess I should get off my
> lazy butt and see what you've done (will you for heaven's sake go and buy an IDE
> disk and compile up a 2.5 kernel? :))

I was just thinking about that, time to go poke at the DAC960 again. Yes I do
remember Jens and Bill offered to help :-)

--
Daniel

2002-09-01 23:48:32

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Monday 02 September 2002 01:08, Andrew Morton wrote:
> Daniel Phillips wrote:
> > Note that I changed the spin_lock in page_cache_release to a trylock, maybe
> > it's worth checking out the effect on contention. With a little head
> > scratching we might be able to get rid of the spin_lock in lru_cache_add as
> > well. That leaves (I think) just the two big scan loops. I've always felt
> > it's silly to run more than one of either at the same time anyway.
>
> No way. Take a look at http://samba.org/~anton/linux/2.5.30/
>
> That's 8-way power4, the workload is "dd from 7 disks
> dd if=/dev/sd* of=/dev/null bs=1024k".
>
> The CPU load in this situation was dominated by the VM. The LRU list and page
> reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is
> pretty gross.

Are we looking at the same thing? The cpu load there is dominated by cpu_idle,
89%. Anyway, if your point is that it makes sense to run shrink_cache or
refill_inactive in parallel, I don't see it because they'll serialize on the
lru lock anyway. What would make sense is to make shink_cache nonblocking.

> My approach was to keep the existing design and warm it up, rather than to
> redesign.

Yup.

--
Daniel

2002-09-02 00:02:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

Daniel Phillips wrote:
>
> On Monday 02 September 2002 01:08, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > > Note that I changed the spin_lock in page_cache_release to a trylock, maybe
> > > it's worth checking out the effect on contention. With a little head
> > > scratching we might be able to get rid of the spin_lock in lru_cache_add as
> > > well. That leaves (I think) just the two big scan loops. I've always felt
> > > it's silly to run more than one of either at the same time anyway.
> >
> > No way. Take a look at http://samba.org/~anton/linux/2.5.30/
> >
> > That's 8-way power4, the workload is "dd from 7 disks
> > dd if=/dev/sd* of=/dev/null bs=1024k".
> >
> > The CPU load in this situation was dominated by the VM. The LRU list and page
> > reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is
> > pretty gross.
>
> Are we looking at the same thing? The cpu load there is dominated by cpu_idle,
> 89%.

Apart from that, dummy ;) The absolute numbers are proportional
to IO bandwidth. And 7/8ths of a scsi disk per CPU isn't much.

> Anyway, if your point is that it makes sense to run shrink_cache or
> refill_inactive in parallel, I don't see it because they'll serialize on the
> lru lock anyway. What would make sense is to make shink_cache nonblocking.

Well shrink_list() runs locklessly now. But there remains a significant
cost in shrink_cache(), a little of which will be due to the hopefully-removable
page_cache_get() inside the lock.

2002-09-02 00:45:37

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Monday 02 September 2002 02:17, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > On Monday 02 September 2002 01:08, Andrew Morton wrote:
> > > Daniel Phillips wrote:
> > > > Note that I changed the spin_lock in page_cache_release to a trylock, maybe
> > > > it's worth checking out the effect on contention. With a little head
> > > > scratching we might be able to get rid of the spin_lock in lru_cache_add as
> > > > well. That leaves (I think) just the two big scan loops. I've always felt
> > > > it's silly to run more than one of either at the same time anyway.
> > >
> > > No way. Take a look at http://samba.org/~anton/linux/2.5.30/
> > >
> > > That's 8-way power4, the workload is "dd from 7 disks
> > > dd if=/dev/sd* of=/dev/null bs=1024k".
> > >
> > > The CPU load in this situation was dominated by the VM. The LRU list and page
> > > reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is
> > > pretty gross.
> >
> > Are we looking at the same thing? The cpu load there is dominated by cpu_idle,
> > 89%.
>
> Apart from that, dummy ;) The absolute numbers are proportional
> to IO bandwidth. And 7/8ths of a scsi disk per CPU isn't much.

OK, fine, my point here is that there's no real excuse for contention on the
lru lock. Without departing radically from the current design the most you
can possibly spend on page reclaiming is 100% of one cpu, and we ought to
simply aim at getting the most bang for the buck out of that one cpu. It
follows that we want only one copy of shrink_cache or refill_inactive running
at a time, otherwise we're tossing away cycles spinning on locks and getting
nothing back.

If lru_cache_add is costly then we can kill both the lock contention and the
cacheline pingponging by feeding the new pages into a local list, then
grafting the whole list onto the inactive list when we start to scan. We
want to do something like that for nonsucky readahead as well.

--
Daniel

2002-09-02 01:04:47

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Mon, 2 Sep 2002, Daniel Phillips wrote:

> Are we looking at the same thing? The cpu load there is dominated by
> cpu_idle, 89%. Anyway, if your point is that it makes sense to run
> shrink_cache or refill_inactive in parallel, I don't see it because
> they'll serialize on the lru lock anyway. What would make sense is to
> make shink_cache nonblocking.

Working on it ;)

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-02 01:35:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

Daniel Phillips wrote:
>
> ...
> If lru_cache_add is costly then we can kill both the lock contention and the
> cacheline pingponging by feeding the new pages into a local list, then
> grafting the whole list onto the inactive list when we start to scan.

2.5.33 does this.

2002-09-02 17:18:53

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Sat, Aug 31, 2002 at 09:47:29PM +0200, Daniel Phillips wrote:
> > Also there may be lru only pages on the active list, i.e. refill
> > inactive should have this hunk as well:
> >
> > > +#if LRU_PLUS_CACHE==2
> > > + BUG_ON(!page_count(page));
> > > + if (unlikely(page_count(page) == 1)) {
> > > + mmstat(vmscan_free_page);
> > > + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!!
> > > + put_page(page);
> > > + continue;
> > > + }
> > > +#endif
>
> If we have orphans on the active list, we'd probably better just count
> them and figure out what we're doing wrong to put them there in the first
> place. In time they will migrate to the inactive list and get cleaned
> up.

Hm, think of your favourite memory hog being killed with lots of anon
pages on the active list while someone else holds the lru lock.
Won't all these anon pages legally end up orphaned on the active list
(due to the trylock in page_cache_release)?

regards Christian

--
THAT'S ALL FOLKS!

2002-09-02 18:16:35

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] [PATCH] Include LRU in page count

On Monday 02 September 2002 19:23, Christian Ehrhardt wrote:
> On Sat, Aug 31, 2002 at 09:47:29PM +0200, Daniel Phillips wrote:
> > > Also there may be lru only pages on the active list, i.e. refill
> > > inactive should have this hunk as well:
> > >
> > > > +#if LRU_PLUS_CACHE==2
> > > > + BUG_ON(!page_count(page));
> > > > + if (unlikely(page_count(page) == 1)) {
> > > > + mmstat(vmscan_free_page);
> > > > + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!!
> > > > + put_page(page);
> > > > + continue;
> > > > + }
> > > > +#endif
> >
> > If we have orphans on the active list, we'd probably better just count
> > them and figure out what we're doing wrong to put them there in the first
> > place. In time they will migrate to the inactive list and get cleaned
> > up.
>
> Hm, think of your favourite memory hog being killed with lots of anon
> pages on the active list while someone else holds the lru lock.
> Won't all these anon pages legally end up orphaned on the active list
> (due to the trylock in page_cache_release)?

Some of them will, for one pass of refill_inactive. It seems hard to justify
saving a single pass through the active list executed in rare, pathological
circumstances, in return for adding even a simple test, executed commonly.

On a dual processor system, one of them should be scanning while the
other is oom_killing. It should work out fine.

--
Daniel

2002-09-05 04:36:02

by Daniel Phillips

[permalink] [raw]
Subject: [RFC] Alternative raceless page free

For completeness, I implemented the atomic_dec_and_test version of raceless
page freeing suggested by Manfred Spraul. The atomic_dec_and_test approach
eliminates the free race by ensuring that when a page's count drops to zero
the lru list lock is taken atomically, leaving no window where the page can
also be found and manipulated on the lru list.[1] Both this and the
extra-lru-count version are supported in the linked patch:

http://people.nl.linux.org/~phillips/patches/lru.race-2.4.19-2

The atomic_dec_and_test version is slightly simpler, but was actually more
work to implement because of the need to locate and eliminate all uses of
page_cache_release where the lru lock is known to be held, as these will
deadlock. That had the side effect of eliminating a number of ifdefs vs the
lru count version, and rooting out some hidden redundancy.

The patch exposes __free_pages_ok, which must called directly by the
atomic_dec_and_lock variant. In the process it got a less confusing name -
recover_pages. (The incumbent name is confusing because all other 'free'
variants in addition manipulate the page count.)

It's a close call which version is faster. I suspect the atomic_dec_and_lock
version will not scale quite as well because of the bus-locked cmpxchg on the
page count (optimized version; unoptimized version always takes the spinlock)
but neither version really lacks in the speed department.

I have a slight preference for the extra-lru-count version, because of the
trylock in page_cache_release. This means that nobody will have to spin when
shrink_cache is active. Instead, freed pages that collide with the lru lock
can just be left on the lru list to be picked up efficiently later. The
trylock also allows the lru lock to be acquired speculatively from interrupt
context, without a requirement that lru lock holders disable interrupts.
Both versions are provably correct, modulo implementation gaffs.

The linked patch defaults to atomic_dec_and_lock version. To change to
the extra count version, define LRU_PLUS_CACHE as 2 instead of 1.

Christian, can you please run this one through your race detector?

[1] As a corollary, pages with zero count can never be found on the lru list,
so that is treated as a bug.

--
Daniel

2002-09-05 12:29:38

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC] Alternative raceless page free

On Thu, Sep 05, 2002 at 06:42:12AM +0200, Daniel Phillips wrote:
> I have a slight preference for the extra-lru-count version, because of the
> trylock in page_cache_release. This means that nobody will have to spin when
> shrink_cache is active. Instead, freed pages that collide with the lru lock
> can just be left on the lru list to be picked up efficiently later. The
> trylock also allows the lru lock to be acquired speculatively from interrupt
> context, without a requirement that lru lock holders disable interrupts.
> Both versions are provably correct, modulo implementation gaffs.
>
> The linked patch defaults to atomic_dec_and_lock version. To change to
> the extra count version, define LRU_PLUS_CACHE as 2 instead of 1.
>
> Christian, can you please run this one through your race detector?

I don't see a priciple problem with this approach (except that there
may be architectures that don't have cmpxchg or similar). But this hunk

@@ -455,7 +458,7 @@
} else {
/* failed to drop the buffers so stop here */
UnlockPage(page);
- page_cache_release(page);
+ put_page(page);

spin_lock(&pagemap_lru_lock);
continue;

looks a bit suspicious. put_page is not allowed if the page is still
on the lru and there is no other reference to it. As we don't hold any
locks between UnlockPage and put_page there is no formal guarantee that
the above condition is met. I don't have another path that could race
with this one though and chances are that there actually is none.

regards Christian

--
THAT'S ALL FOLKS!

2002-09-05 15:36:54

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Alternative raceless page free

On Thursday 05 September 2002 14:34, Christian Ehrhardt wrote:
> @@ -455,7 +458,7 @@
> } else {
> /* failed to drop the buffers so stop here */
> UnlockPage(page);
> - page_cache_release(page);
> + put_page(page);
>
> spin_lock(&pagemap_lru_lock);
> continue;
>
> looks a bit suspicious. put_page is not allowed if the page is still
> on the lru and there is no other reference to it. As we don't hold any
> locks between UnlockPage and put_page there is no formal guarantee that
> the above condition is met. I don't have another path that could race
> with this one though and chances are that there actually is none.

The corresponding get_page is just above, you must have overlooked it.

Besides that, we have a promise that the page still has buffers, worth
another count, and the page will never be freed here. That's fragile
though, and this particular piece of code can no doubt be considerably
simplified, while improving robustness and efficiency at the same time.
But that goes beyond the scope of this patch.

The whole try_to_free_buffers path is hairy, scary and disgusting. It
could use some critical analysis, if anybody has time.

--
Daniel

2002-09-05 15:59:58

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC] Alternative raceless page free

On Thu, Sep 05, 2002 at 05:21:31PM +0200, Daniel Phillips wrote:
> On Thursday 05 September 2002 14:34, Christian Ehrhardt wrote:
> > @@ -455,7 +458,7 @@
> > } else {
> > /* failed to drop the buffers so stop here */
> > UnlockPage(page);
> > - page_cache_release(page);
> > + put_page(page);
> >
> > spin_lock(&pagemap_lru_lock);
> > continue;
> >
> > looks a bit suspicious. put_page is not allowed if the page is still
> > on the lru and there is no other reference to it. As we don't hold any
> > locks between UnlockPage and put_page there is no formal guarantee that
> > the above condition is met. I don't have another path that could race
> > with this one though and chances are that there actually is none.
>
> The corresponding get_page is just above, you must have overlooked it.

See it. The scenario im thinking about is:

CPU1 CPU2
pick page from lru
Lock it
get_page
try_to_release_buffers fails
UnlockPage
/* Window here */
Pick page form lru
page_cache_get
Lock it
actually free the buffers
UnlockPage
page_cache_release
doesn't remove the page from
the lru because CPU 1 holds
a reference
put_page dropps last reference
but doenn't remove the page from
the lru because it is put_page
and not page_cache release.
---> Page gets freed while it is still on the lru. If the lru cache
holds a reference that's a non issue.

CPU1 is the path in shrink_cache.
I admit that I don't have a real path that does what CPU2 does in
this scenario but doing what CPU2 does should be perfectly legal.

> Besides that, we have a promise that the page still has buffers, worth

Not after UnlockPage.

> another count, and the page will never be freed here. That's fragile
> though, and this particular piece of code can no doubt be considerably
> simplified, while improving robustness and efficiency at the same time.
> But that goes beyond the scope of this patch.

Well yes ;-) There's funny things going on like accessing a page
after page_cache_release...

regards Christian

--
THAT'S ALL FOLKS!

2002-09-05 16:26:09

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Alternative raceless page free

On Thursday 05 September 2002 18:04, Christian Ehrhardt wrote:
> On Thu, Sep 05, 2002 at 05:21:31PM +0200, Daniel Phillips wrote:
> > On Thursday 05 September 2002 14:34, Christian Ehrhardt wrote:
> > > @@ -455,7 +458,7 @@
> > > } else {
> > > /* failed to drop the buffers so stop here */
> > > UnlockPage(page);
> > > - page_cache_release(page);
> > > + put_page(page);
> > >
> > > spin_lock(&pagemap_lru_lock);
> > > continue;
> > >
> > > looks a bit suspicious. put_page is not allowed if the page is still
> > > on the lru and there is no other reference to it. As we don't hold any
> > > locks between UnlockPage and put_page there is no formal guarantee that
> > > the above condition is met. I don't have another path that could race
> > > with this one though and chances are that there actually is none.
> >
> > The corresponding get_page is just above, you must have overlooked it.

I misspoke. You're right, page_cache_release is the appropriate function
here, not put_page. Fixed, thanks.

--
Daniel

2002-09-05 16:47:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Alternative raceless page free

On Thursday 05 September 2002 18:04, Christian Ehrhardt wrote:
> On Thu, Sep 05, 2002 at 05:21:31PM +0200, Daniel Phillips wrote:
> > ...this particular piece of code can no doubt be considerably
> > simplified, while improving robustness and efficiency at the same time.
> > But that goes beyond the scope of this patch.
>
> Well yes ;-) There's funny things going on like accessing a page
> after page_cache_release...

You're right about that one too, it should be put_page_nofree, since
freeing the page there is a bug for two reasons: !!page->mapping tells
us there should be a count on the page, and we continue to operate on
the page immediately below. Thanks again.

--
Daniel

2002-09-05 18:01:08

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Alternative raceless page free, updated

Here's an updated patch with improved/corrected page count handling
in shrink_cache along the lines we discussed:

http://people.nl.linux.org/~phillips/patches/lru.race-2.4.19-3

--
Daniel