2006-01-18 10:40:32

by Nick Piggin

[permalink] [raw]
Subject: [patch 0/4] mm: de-skew page refcount

The following patchset (against 2.6.16-rc1 + migrate race fixes) uses the new
atomic ops to do away with the offset page refcounting, and simplify the race
that it was designed to cover.

This allows some nice optimisations, and in the page freeing path we end up
saving 2 atomic ops including a spin_lock_irqsave in the !PageLRU case, and 1
or 2 atomic ops in the PageLRU case.

Andrew's previous feedback has been incorporated (less BUG_ONs in fastpaths
and more detail in changelogs).

Anyone spot any holes or races?

Thanks,
Nick


2006-01-18 10:40:43

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/4] mm: page refcount use atomic primitives

The VM has an interesting race where a page refcount can drop to zero, but
it is still on the LRU lists for a short time. This was solved by testing
a 0->1 refcount transition when picking up pages from the LRU, and dropping
the refcount in that case.

Instead, use atomic_inc_not_zero to ensure we never pick up a 0 refcount
page from the LRU (ie. we guarantee that such a page will not be touched
by vmscan), so we need not hold ->lru_lock to stabalise PageLRU.

This ensures we can test PageLRU without taking the lru_lock, and allows
further optimisations (in later patches) -- we end up saving 2 atomic ops
including a spin_lock_irqsave in the !PageLRU case, and 1 or 2 atomic ops
in the PageLRU case.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -286,43 +286,44 @@ struct page {
*
* Also, many kernel routines increase the page count before a critical
* routine so they can be sure the page doesn't go away from under them.
- *
- * Since 2.6.6 (approx), a free page has ->_count = -1. This is so that we
- * can use atomic_add_negative(-1, page->_count) to detect when the page
- * becomes free and so that we can also use atomic_inc_and_test to atomically
- * detect when we just tried to grab a ref on a page which some other CPU has
- * already deemed to be freeable.
- *
- * NO code should make assumptions about this internal detail! Use the provided
- * macros which retain the old rules: page_count(page) == 0 is a free page.
*/

/*
* Drop a ref, return true if the logical refcount fell to zero (the page has
* no users)
*/
-#define put_page_testzero(p) \
- ({ \
- BUG_ON(page_count(p) == 0); \
- atomic_add_negative(-1, &(p)->_count); \
- })
+static inline int put_page_testzero(struct page *page)
+{
+ BUG_ON(atomic_read(&page->_count) == 0);
+ return atomic_dec_and_test(&page->_count);
+}

/*
* Grab a ref, return true if the page previously had a logical refcount of
* zero. ie: returns true if we just grabbed an already-deemed-to-be-free page
*/
-#define get_page_testone(p) atomic_inc_and_test(&(p)->_count)
+static inline int get_page_unless_zero(struct page *page)
+{
+ return atomic_inc_not_zero(&page->_count);
+}

-#define set_page_count(p,v) atomic_set(&(p)->_count, (v) - 1)
-#define __put_page(p) atomic_dec(&(p)->_count)
+static inline void set_page_count(struct page *page, int v)
+{
+ atomic_set(&page->_count, v);
+}
+
+static inline void __put_page(struct page *page)
+{
+ atomic_dec(&page->_count);
+}

extern void FASTCALL(__page_cache_release(struct page *));

static inline int page_count(struct page *page)
{
- if (PageCompound(page))
+ if (unlikely(PageCompound(page)))
page = (struct page *)page_private(page);
- return atomic_read(&page->_count) + 1;
+ return atomic_read(&page->_count);
}

static inline void get_page(struct page *page)
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -815,24 +815,20 @@ static int isolate_lru_pages(int nr_to_s
int scan = 0;

while (scan++ < nr_to_scan && !list_empty(src)) {
+ struct list_head *target;
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);

- if (!TestClearPageLRU(page))
- BUG();
+ BUG_ON(!PageLRU(page));
list_del(&page->lru);
- if (get_page_testone(page)) {
- /*
- * It is being freed elsewhere
- */
- __put_page(page);
- SetPageLRU(page);
- list_add(&page->lru, src);
- continue;
- } else {
- list_add(&page->lru, dst);
+ target = src;
+ if (get_page_unless_zero(page)) {
+ ClearPageLRU(page);
+ target = dst;
nr_taken++;
- }
+ } /* else it is being freed elsewhere */
+
+ list_add(&page->lru, target);
}

*scanned = scan;
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -198,17 +198,18 @@ int lru_add_drain_all(void)
*/
void fastcall __page_cache_release(struct page *page)
{
- unsigned long flags;
- struct zone *zone = page_zone(page);
+ if (PageLRU(page)) {
+ unsigned long flags;

- spin_lock_irqsave(&zone->lru_lock, flags);
- if (TestClearPageLRU(page))
+ struct zone *zone = page_zone(page);
+ spin_lock_irqsave(&zone->lru_lock, flags);
+ if (!TestClearPageLRU(page))
+ BUG();
del_page_from_lru(zone, page);
- if (page_count(page) != 0)
- page = NULL;
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- if (page)
- free_hot_page(page);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+ }
+
+ free_hot_page(page);
}

EXPORT_SYMBOL(__page_cache_release);
@@ -234,27 +235,30 @@ void release_pages(struct page **pages,
pagevec_init(&pages_to_free, cold);
for (i = 0; i < nr; i++) {
struct page *page = pages[i];
- struct zone *pagezone;

if (!put_page_testzero(page))
continue;

- pagezone = page_zone(page);
- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- if (TestClearPageLRU(page))
+ if (PageLRU(page)) {
+ struct zone *pagezone = page_zone(page);
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ if (!TestClearPageLRU(page))
+ BUG();
del_page_from_lru(zone, page);
- if (page_count(page) == 0) {
- if (!pagevec_add(&pages_to_free, page)) {
+ }
+
+ if (!pagevec_add(&pages_to_free, page)) {
+ if (zone) {
spin_unlock_irq(&zone->lru_lock);
- __pagevec_free(&pages_to_free);
- pagevec_reinit(&pages_to_free);
- zone = NULL; /* No lock is held */
+ zone = NULL;
}
+ __pagevec_free(&pages_to_free);
+ pagevec_reinit(&pages_to_free);
}
}
if (zone)

2006-01-18 10:41:34

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/4] mm: less atomic ops

In the page release paths, we can be sure that nobody will mess with our
page->flags because the refcount has dropped to 0. So no need for atomic
operations here.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -247,10 +247,12 @@ extern void __mod_page_state_offset(unsi
#define PageLRU(page) test_bit(PG_lru, &(page)->flags)
#define SetPageLRU(page) set_bit(PG_lru, &(page)->flags)
#define ClearPageLRU(page) clear_bit(PG_lru, &(page)->flags)
+#define __ClearPageLRU(page) __clear_bit(PG_lru, &(page)->flags)

#define PageActive(page) test_bit(PG_active, &(page)->flags)
#define SetPageActive(page) set_bit(PG_active, &(page)->flags)
#define ClearPageActive(page) clear_bit(PG_active, &(page)->flags)
+#define __ClearPageActive(page) __clear_bit(PG_active, &(page)->flags)

#define PageSlab(page) test_bit(PG_slab, &(page)->flags)
#define SetPageSlab(page) set_bit(PG_slab, &(page)->flags)
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -204,7 +204,7 @@ void fastcall __page_cache_release(struc
struct zone *zone = page_zone(page);
spin_lock_irqsave(&zone->lru_lock, flags);
BUG_ON(!PageLRU(page));
- ClearPageLRU(page);
+ __ClearPageLRU(page);
del_page_from_lru(zone, page);
spin_unlock_irqrestore(&zone->lru_lock, flags);
}
@@ -248,7 +248,7 @@ void release_pages(struct page **pages,
spin_lock_irq(&zone->lru_lock);
}
BUG_ON(!PageLRU(page));
- ClearPageLRU(page);
+ __ClearPageLRU(page);
del_page_from_lru(zone, page);
}

Index: linux-2.6/include/linux/mm_inline.h
===================================================================
--- linux-2.6.orig/include/linux/mm_inline.h
+++ linux-2.6/include/linux/mm_inline.h
@@ -32,7 +32,7 @@ del_page_from_lru(struct zone *zone, str
{
list_del(&page->lru);
if (PageActive(page)) {
- ClearPageActive(page);
+ __ClearPageActive(page);
zone->nr_active--;
} else {
zone->nr_inactive--;

2006-01-18 10:41:10

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/3] mm: PageActive no testset

PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear
operations.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -997,8 +997,9 @@ refill_inactive_zone(struct zone *zone,
prefetchw_prev_lru_page(page, &l_inactive, flags);
BUG_ON(PageLRU(page));
SetPageLRU(page);
- if (!TestClearPageActive(page))
- BUG();
+ BUG_ON(!PageActive(page));
+ ClearPageActive(page);
+
list_move(&page->lru, &zone->inactive_list);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -356,8 +356,8 @@ void __pagevec_lru_add_active(struct pag
}
BUG_ON(PageLRU(page));
SetPageLRU(page);
- if (TestSetPageActive(page))
- BUG();
+ BUG_ON(PageActive(page));
+ SetPageActive(page);
add_page_to_active_list(zone, page);
}
if (zone)
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -251,8 +251,6 @@ extern void __mod_page_state_offset(unsi
#define PageActive(page) test_bit(PG_active, &(page)->flags)
#define SetPageActive(page) set_bit(PG_active, &(page)->flags)
#define ClearPageActive(page) clear_bit(PG_active, &(page)->flags)
-#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
-#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)

#define PageSlab(page) test_bit(PG_slab, &(page)->flags)
#define SetPageSlab(page) set_bit(PG_slab, &(page)->flags)

2006-01-18 10:40:54

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/4] mm: PageLRU no testset

PG_lru is protected by zone->lru_lock. It does not need TestSet/TestClear
operations.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -775,9 +775,10 @@ int isolate_lru_page(struct page *page)
if (PageLRU(page)) {
struct zone *zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
- if (TestClearPageLRU(page)) {
+ if (PageLRU(page)) {
ret = 1;
get_page(page);
+ ClearPageLRU(page);
if (PageActive(page))
del_page_from_active_list(zone, page);
else
@@ -881,8 +882,8 @@ static void shrink_cache(struct zone *zo
*/
while (!list_empty(&page_list)) {
page = lru_to_page(&page_list);
- if (TestSetPageLRU(page))
- BUG();
+ BUG_ON(PageLRU(page));
+ SetPageLRU(page);
list_del(&page->lru);
if (PageActive(page))
add_page_to_active_list(zone, page);
@@ -994,8 +995,8 @@ refill_inactive_zone(struct zone *zone,
while (!list_empty(&l_inactive)) {
page = lru_to_page(&l_inactive);
prefetchw_prev_lru_page(page, &l_inactive, flags);
- if (TestSetPageLRU(page))
- BUG();
+ BUG_ON(PageLRU(page));
+ SetPageLRU(page);
if (!TestClearPageActive(page))
BUG();
list_move(&page->lru, &zone->inactive_list);
@@ -1023,8 +1024,8 @@ refill_inactive_zone(struct zone *zone,
while (!list_empty(&l_active)) {
page = lru_to_page(&l_active);
prefetchw_prev_lru_page(page, &l_active, flags);
- if (TestSetPageLRU(page))
- BUG();
+ BUG_ON(PageLRU(page));
+ SetPageLRU(page);
BUG_ON(!PageActive(page));
list_move(&page->lru, &zone->active_list);
pgmoved++;
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -203,8 +203,8 @@ void fastcall __page_cache_release(struc

struct zone *zone = page_zone(page);
spin_lock_irqsave(&zone->lru_lock, flags);
- if (!TestClearPageLRU(page))
- BUG();
+ BUG_ON(!PageLRU(page));
+ ClearPageLRU(page);
del_page_from_lru(zone, page);
spin_unlock_irqrestore(&zone->lru_lock, flags);
}
@@ -247,8 +247,8 @@ void release_pages(struct page **pages,
zone = pagezone;
spin_lock_irq(&zone->lru_lock);
}
- if (!TestClearPageLRU(page))
- BUG();
+ BUG_ON(!PageLRU(page));
+ ClearPageLRU(page);
del_page_from_lru(zone, page);
}

@@ -327,8 +327,8 @@ void __pagevec_lru_add(struct pagevec *p
zone = pagezone;
spin_lock_irq(&zone->lru_lock);
}
- if (TestSetPageLRU(page))
- BUG();
+ BUG_ON(PageLRU(page));
+ SetPageLRU(page);
add_page_to_inactive_list(zone, page);
}
if (zone)
@@ -354,8 +354,8 @@ void __pagevec_lru_add_active(struct pag
zone = pagezone;
spin_lock_irq(&zone->lru_lock);
}
- if (TestSetPageLRU(page))
- BUG();
+ BUG_ON(PageLRU(page));
+ SetPageLRU(page);
if (TestSetPageActive(page))
BUG();
add_page_to_active_list(zone, page);
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -244,10 +244,9 @@ extern void __mod_page_state_offset(unsi
#define __ClearPageDirty(page) __clear_bit(PG_dirty, &(page)->flags)
#define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)

-#define SetPageLRU(page) set_bit(PG_lru, &(page)->flags)
#define PageLRU(page) test_bit(PG_lru, &(page)->flags)
-#define TestSetPageLRU(page) test_and_set_bit(PG_lru, &(page)->flags)
-#define TestClearPageLRU(page) test_and_clear_bit(PG_lru, &(page)->flags)
+#define SetPageLRU(page) set_bit(PG_lru, &(page)->flags)
+#define ClearPageLRU(page) clear_bit(PG_lru, &(page)->flags)

#define PageActive(page) test_bit(PG_active, &(page)->flags)
#define SetPageActive(page) set_bit(PG_active, &(page)->flags)

2006-01-18 16:13:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch 3/3] mm: PageActive no testset

Hi Nick,

On Wed, Jan 18, 2006 at 11:40:58AM +0100, Nick Piggin wrote:
> PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear
> operations.

page->flags bits (including PG_active and PG_lru bits) are touched by
several codepaths which do not hold zone->lru_lock.

AFAICT zone->lru_lock guards access to the LRU list, and no more than
that.

Moreover, what about consistency of the rest of page->flags bits?

PPC for example implements test_and_set_bit() with:

lwarx reg, addr (load and create reservation for 32-bit addr)
or reg, BITOP_MASK(nr)
stwcx reg, addr (store word upon reservation validation, otherwise loop)

If you don't use atomic operations on page->flags, unrelated bits other
than that you're working with can have their updates lost, given that
in reality page->flags is not protected by the lru_lock.

For example:

/*
* shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
*/
static int shrink_list(struct list_head *page_list, struct scan_control *sc)
{
...
BUG_ON(PageActive(page));
...
activate_locked:
SetPageActive(page);
pgactivate++;
keep_locked:
unlock_page(page);
keep:
list_add(&page->lru, &ret_pages);
BUG_ON(PageLRU(page));
}

And recently:

#ifdef CONFIG_MIGRATION
static inline void move_to_lru(struct page *page)
{
list_del(&page->lru);
if (PageActive(page)) {
/*
* lru_cache_add_active checks that
* the PG_active bit is off.
*/
ClearPageActive(page);
lru_cache_add_active(page);

Not relying on zone->lru_lock allows interesting optimizations
such as moving active/inactive pgflag bit setting from inside
__pagevec_lru_add/__pagevec_lru_add_active to the caller, and merging
the two.

Comments?


> Signed-off-by: Nick Piggin <[email protected]>
>
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -997,8 +997,9 @@ refill_inactive_zone(struct zone *zone,
> prefetchw_prev_lru_page(page, &l_inactive, flags);
> BUG_ON(PageLRU(page));
> SetPageLRU(page);
> - if (!TestClearPageActive(page))
> - BUG();
> + BUG_ON(!PageActive(page));
> + ClearPageActive(page);
> +
> list_move(&page->lru, &zone->inactive_list);
> pgmoved++;
> if (!pagevec_add(&pvec, page)) {
> Index: linux-2.6/mm/swap.c
> ===================================================================
> --- linux-2.6.orig/mm/swap.c
> +++ linux-2.6/mm/swap.c
> @@ -356,8 +356,8 @@ void __pagevec_lru_add_active(struct pag
> }
> BUG_ON(PageLRU(page));
> SetPageLRU(page);
> - if (TestSetPageActive(page))
> - BUG();
> + BUG_ON(PageActive(page));
> + SetPageActive(page);
> add_page_to_active_list(zone, page);
> }
> if (zone)
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -251,8 +251,6 @@ extern void __mod_page_state_offset(unsi
> #define PageActive(page) test_bit(PG_active, &(page)->flags)
> #define SetPageActive(page) set_bit(PG_active, &(page)->flags)
> #define ClearPageActive(page) clear_bit(PG_active, &(page)->flags)
> -#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
> -#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)
>
> #define PageSlab(page) test_bit(PG_slab, &(page)->flags)
> #define SetPageSlab(page) set_bit(PG_slab, &(page)->flags)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2006-01-18 16:38:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount



On Wed, 18 Jan 2006, Nick Piggin wrote:
>
> The following patchset (against 2.6.16-rc1 + migrate race fixes) uses the new
> atomic ops to do away with the offset page refcounting, and simplify the race
> that it was designed to cover.
>
> This allows some nice optimisations

Why?

The real downside is that "atomic_inc_nonzero()" is a lot more expensive
than checking for zero on x86 (and x86-64).

The reason it's offset is that on architectures that automatically test
the _result_ of an atomic op (ie x86[-64]), it's easy to see when
something _becomes_ negative or _becomes_ zero, and that's what

atomic_add_negative
atomic_inc_and_test

are optimized for (there's also "atomic_dec_and_test()" which reacts on
the count becoming zero, but that doesn't have a pairing: there's no way
to react to the count becoming one for the increment operation, so the
"atomic_dec_and_test()" is used for things where zero means "free it").

Nothing else can be done that fast on x86. Everything else requires an
insane "load, update, cmpxchg" sequence.

So I disagree with this patch series. It has real downsides. There's a
reason we have the offset.

I suspect that whatever "nice optimizations" you have are quite doable
without doing this count pessimization.

Linus

2006-01-18 17:06:08

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount

On Wed, Jan 18, 2006 at 08:38:44AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 18 Jan 2006, Nick Piggin wrote:
> >
> > The following patchset (against 2.6.16-rc1 + migrate race fixes) uses the new
> > atomic ops to do away with the offset page refcounting, and simplify the race
> > that it was designed to cover.
> >
> > This allows some nice optimisations
>
> Why?
>
> The real downside is that "atomic_inc_nonzero()" is a lot more expensive
> than checking for zero on x86 (and x86-64).
>
> The reason it's offset is that on architectures that automatically test
> the _result_ of an atomic op (ie x86[-64]), it's easy to see when
> something _becomes_ negative or _becomes_ zero, and that's what
>
> atomic_add_negative
> atomic_inc_and_test
>
> are optimized for (there's also "atomic_dec_and_test()" which reacts on
> the count becoming zero, but that doesn't have a pairing: there's no way
> to react to the count becoming one for the increment operation, so the
> "atomic_dec_and_test()" is used for things where zero means "free it").
>
> Nothing else can be done that fast on x86. Everything else requires an
> insane "load, update, cmpxchg" sequence.
>

Yes, I realise inc_not_zero isn't as fast as dec_and_test on x86(-64).
In this case when the cacheline will already be exclusive I bet it isn't
that much of a difference (in the noise from my testing).

> So I disagree with this patch series. It has real downsides. There's a
> reason we have the offset.
>

Yes, there is a reason, I detailed it in the changelog and got rid of it.

> "nice optimizations"

They're in this patchset.

Nick

2006-01-18 19:27:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount



On Wed, 18 Jan 2006, Nick Piggin wrote:
>
> > So I disagree with this patch series. It has real downsides. There's a
> > reason we have the offset.
>
> Yes, there is a reason, I detailed it in the changelog and got rid of it.

And I'm not applying it. I'd be crazy to replace good code by code that is
objectively _worse_.

The fact that you _document_ that it's worse doesn't make it any better.

The places that you improve (in the other patches) seem to have nothing at
all to do with the counter skew issue, so I don't see the point.

So let me repeat: WHY DID YOU MAKE THE CODE WORSE?

Linus

2006-01-19 14:00:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount

On Wed, Jan 18, 2006 at 11:27:13AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 18 Jan 2006, Nick Piggin wrote:
> >
> > > So I disagree with this patch series. It has real downsides. There's a
> > > reason we have the offset.
> >
> > Yes, there is a reason, I detailed it in the changelog and got rid of it.
>
> And I'm not applying it. I'd be crazy to replace good code by code that is
> objectively _worse_.
>

And you're not? Damn.

> The fact that you _document_ that it's worse doesn't make it any better.
>
> The places that you improve (in the other patches) seem to have nothing at
> all to do with the counter skew issue, so I don't see the point.
>

You know, I believe you're right. I needed the de-skewing patch for
something unrelated and it seemed that it opened the possibility for
the following optimisations (ie. because we no longer touch a page
after its refcount goes to zero).

But actually it doesn't matter that we might touch page_count, only
that we not clear PageLRU. So the enabler is simply moving the
TestClearPageLRU after the get_page_testone.

So I'll respin the patches without the de-skewing and the series
will become much smaller and neater.

> So let me repeat: WHY DID YOU MAKE THE CODE WORSE?
>

You've never bothered me about that until now...

Thanks for the feedback!

Nick

2006-01-19 14:50:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: PageActive no testset

Hi Marcelo,

On Wed, Jan 18, 2006 at 12:13:46PM -0200, Marcelo Tosatti wrote:
> Hi Nick,
>
> On Wed, Jan 18, 2006 at 11:40:58AM +0100, Nick Piggin wrote:
> > PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear
> > operations.
>
> page->flags bits (including PG_active and PG_lru bits) are touched by
> several codepaths which do not hold zone->lru_lock.
>
> AFAICT zone->lru_lock guards access to the LRU list, and no more than
> that.
>

Yep.

> Moreover, what about consistency of the rest of page->flags bits?
>

That's OK, set_bit and clear_bit are atomic as well, they just don't
imply memory barriers and can be implemented a bit more simply.

The test-set / test-clear operations also kind of imply that it is
being used for locking or without other synchronisation (usually).

> PPC for example implements test_and_set_bit() with:
>
> lwarx reg, addr (load and create reservation for 32-bit addr)
> or reg, BITOP_MASK(nr)
> stwcx reg, addr (store word upon reservation validation, otherwise loop)
>

Thanks,
Nick

2006-01-19 16:36:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount



On Thu, 19 Jan 2006, Nick Piggin wrote:
>
> But actually it doesn't matter that we might touch page_count, only
> that we not clear PageLRU. So the enabler is simply moving the
> TestClearPageLRU after the get_page_testone.

One side note on your patch: the pure bit _test_ operation is very cheap,
but the "bit change" operation is very expensive (and not really any less
expensive than the "test-and-change" one).

So the patch to avoid "test_and_clear_bit()" really helps only if the test
usually results in us not doing the clear. Is that the case? Hmm..

So I _think_ that at least the case in "isolate_lru_page()", you'd
actually be better off doing the "test-and-clear" instead of separate
"test" and "clear-bit" ops, no? In that one, it would seem that 99+% of
the time, the bit is set (because we tested it just before getting the
lock).

No?

> I needed the de-skewing patch for something unrelated and it seemed that
> it opened the possibility for the following optimisations (ie. because
> we no longer touch a page after its refcount goes to zero).
>
> But actually it doesn't matter that we might touch page_count, only
> that we not clear PageLRU. So the enabler is simply moving the
> TestClearPageLRU after the get_page_testone.

Yes.

Now, that whole "we might touch the page count" thing does actually worry
me a bit. The locking rules are subtle (but they -seem- safe: before we
actually really put the page on the free-list in the freeing path, we'll
have locked the LRU list if it was on one).

But if you were to change _that_ one to a

atomic_add_unless(&page->counter, 1, -1);

I think that would be a real cleanup. And at that point I won't even
complain that "atomic_inc_test()" is faster - that "get_page_testone()"
thing is just fundamentally a bit scary, so I'd applaud it regardless.

(The difference: the "counter skewing" may be unexpected, but it's just a
simple trick. In contrast, the "touch the count after the page may be
already in the freeing stage" is a scary subtle thing. Even if I can't
see any actual bug in it, it just worries me in a way that offsetting a
counter by one does not..)

Linus

2006-01-19 17:07:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount

On Thu, Jan 19, 2006 at 08:36:14AM -0800, Linus Torvalds wrote:
>
> So I _think_ that at least the case in "isolate_lru_page()", you'd
> actually be better off doing the "test-and-clear" instead of separate
> "test" and "clear-bit" ops, no? In that one, it would seem that 99+% of
> the time, the bit is set (because we tested it just before getting the
> lock).
>
> No?
>

Well in isolate_lru_page, the test operation is actually optional
(ie. it is the conditional for a BUG). And I have plans for making
some of those configurable....

But at least on the G5, test_and_clear can be noticable (although
IIRC it was in the noise for _this_ articular case) because of the
memory barriers required.

>
> Now, that whole "we might touch the page count" thing does actually worry
> me a bit. The locking rules are subtle (but they -seem- safe: before we
> actually really put the page on the free-list in the freeing path, we'll
> have locked the LRU list if it was on one).
>

Yes, I think Andrew did his homework. I thought it through quite a bit
before sending the patches and again after your feedback. Subtle though,
no doubt.

> But if you were to change _that_ one to a
>
> atomic_add_unless(&page->counter, 1, -1);
>
> I think that would be a real cleanup. And at that point I won't even
> complain that "atomic_inc_test()" is faster - that "get_page_testone()"
> thing is just fundamentally a bit scary, so I'd applaud it regardless.
>

Hmm... this is what the de-skew patch _did_ (although it was wrapped
in a function called get_page_unless_zero), in fact the main aim was
to prevent this twiddling and the de-skewing was just a nice side effect
(I guess the patch title is misleading).

So I'm confused...

> (The difference: the "counter skewing" may be unexpected, but it's just a
> simple trick. In contrast, the "touch the count after the page may be
> already in the freeing stage" is a scary subtle thing. Even if I can't
> see any actual bug in it, it just worries me in a way that offsetting a
> counter by one does not..)
>

Don't worry, you'll be seeing that patch again -- it is required for
lockless pagecache.

Nick

2006-01-19 17:27:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount



On Thu, 19 Jan 2006, Nick Piggin wrote:
>
> Hmm... this is what the de-skew patch _did_ (although it was wrapped
> in a function called get_page_unless_zero), in fact the main aim was
> to prevent this twiddling and the de-skewing was just a nice side effect
> (I guess the patch title is misleading).
>
> So I'm confused...

The thing I minded was the _other_ changes, namely the de-skewing itself.
It seemed totally unnecessary to what you claimed was the point of the
patch.

So I objected to the patch on the grounds that it did what you claimed
badly. All the _optimization_ was totally independent of that de-skewing,
and the de-skewing was a potential un-optimization.

But if you do the optimizations as one independent set of patches, and
_then_ do the counter thing as a "simplify logic" patch, I don't see that
as a problem.

Side note: I may be crazy, but for me when merging, one of the biggest
things is "does this pass my 'makes sense' detector". I look less at the
end result, than I actually look at the _change_. See?

That's why two separate patches that do the same thing as one combined
patch may make sense, even if the _combined_ one does not (it could go the
other way too, obviously).

Linus

2006-01-19 17:38:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/4] mm: de-skew page refcount

On Thu, Jan 19, 2006 at 09:27:14AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 19 Jan 2006, Nick Piggin wrote:
> >
> > Hmm... this is what the de-skew patch _did_ (although it was wrapped
> > in a function called get_page_unless_zero), in fact the main aim was
> > to prevent this twiddling and the de-skewing was just a nice side effect
> > (I guess the patch title is misleading).
> >
> > So I'm confused...
>
> The thing I minded was the _other_ changes, namely the de-skewing itself.
> It seemed totally unnecessary to what you claimed was the point of the
> patch.
>
> So I objected to the patch on the grounds that it did what you claimed
> badly. All the _optimization_ was totally independent of that de-skewing,
> and the de-skewing was a potential un-optimization.
>

No longer confused...

> But if you do the optimizations as one independent set of patches, and
> _then_ do the counter thing as a "simplify logic" patch, I don't see that
> as a problem.
>
> Side note: I may be crazy, but for me when merging, one of the biggest
> things is "does this pass my 'makes sense' detector". I look less at the
> end result, than I actually look at the _change_. See?
>
> That's why two separate patches that do the same thing as one combined
> patch may make sense, even if the _combined_ one does not (it could go the
> other way too, obviously).
>

I agree, and the patches really are cleaner this way too, so again,
thanks for the input on them.

I'll resend soonish (with a trimmed cc list).

Nick

2006-01-19 17:54:12

by Nikita Danilov

[permalink] [raw]
Subject: Re: [patch 2/4] mm: PageLRU no testset

Nick Piggin writes:
> PG_lru is protected by zone->lru_lock. It does not need TestSet/TestClear
> operations.
>
> Signed-off-by: Nick Piggin <[email protected]>
>
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -775,9 +775,10 @@ int isolate_lru_page(struct page *page)
> if (PageLRU(page)) {
> struct zone *zone = page_zone(page);
> spin_lock_irq(&zone->lru_lock);
> - if (TestClearPageLRU(page)) {
> + if (PageLRU(page)) {
> ret = 1;
> get_page(page);
> + ClearPageLRU(page);

Why is that better? ClearPageLRU() is also "atomic" operation (in the
sense of using LOCK_PREFIX on x86), so it seems this change strictly
adds cycles to the hot-path when page is on LRU.

Nikita.

2006-01-19 18:10:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/4] mm: PageLRU no testset

On Thu, Jan 19, 2006 at 08:48:52PM +0300, Nikita Danilov wrote:
> Nick Piggin writes:
> > PG_lru is protected by zone->lru_lock. It does not need TestSet/TestClear
> > operations.
> >
> > Signed-off-by: Nick Piggin <[email protected]>
> >
> > Index: linux-2.6/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.orig/mm/vmscan.c
> > +++ linux-2.6/mm/vmscan.c
> > @@ -775,9 +775,10 @@ int isolate_lru_page(struct page *page)
> > if (PageLRU(page)) {
> > struct zone *zone = page_zone(page);
> > spin_lock_irq(&zone->lru_lock);
> > - if (TestClearPageLRU(page)) {
> > + if (PageLRU(page)) {
> > ret = 1;
> > get_page(page);
> > + ClearPageLRU(page);
>
> Why is that better? ClearPageLRU() is also "atomic" operation (in the
> sense of using LOCK_PREFIX on x86), so it seems this change strictly
> adds cycles to the hot-path when page is on LRU.
>

Less restrictive memory ordering requirements.

Nick

2006-01-19 18:53:29

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch 3/3] mm: PageActive no testset

On Thu, Jan 19, 2006 at 03:50:08PM +0100, Nick Piggin wrote:
> Hi Marcelo,
>
> On Wed, Jan 18, 2006 at 12:13:46PM -0200, Marcelo Tosatti wrote:
> > Hi Nick,
> >
> > On Wed, Jan 18, 2006 at 11:40:58AM +0100, Nick Piggin wrote:
> > > PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear
> > > operations.
> >
> > page->flags bits (including PG_active and PG_lru bits) are touched by
> > several codepaths which do not hold zone->lru_lock.
> >
> > AFAICT zone->lru_lock guards access to the LRU list, and no more than
> > that.
> >
>
> Yep.
>
> > Moreover, what about consistency of the rest of page->flags bits?
> >
>
> That's OK, set_bit and clear_bit are atomic as well, they just don't
> imply memory barriers and can be implemented a bit more simply.

Indeed!

> The test-set / test-clear operations also kind of imply that it is
> being used for locking or without other synchronisation (usually).

Non-atomic versions such as __ClearPageLRU()/__ClearPageActive() are
not usable, though.

PPC:

static __inline__ void __clear_bit(unsigned long nr,
volatile unsigned long *addr)
{
unsigned long mask = BITOP_MASK(nr);
unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);

*p &= ~mask;
}

2006-01-19 20:02:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: PageActive no testset

On Thu, Jan 19, 2006 at 02:52:22PM -0200, Marcelo Tosatti wrote:
> On Thu, Jan 19, 2006 at 03:50:08PM +0100, Nick Piggin wrote:
>
> > The test-set / test-clear operations also kind of imply that it is
> > being used for locking or without other synchronisation (usually).
>
> Non-atomic versions such as __ClearPageLRU()/__ClearPageActive() are
> not usable, though.
>

Correct. Although I was able to use them in a couple of other places
in a subsequent patch in the series. I trust you don't see a problem
with those usages?

Thanks,
Nick

2006-01-19 23:42:00

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch 3/3] mm: PageActive no testset

On Thu, Jan 19, 2006 at 09:02:26PM +0100, Nick Piggin wrote:
> On Thu, Jan 19, 2006 at 02:52:22PM -0200, Marcelo Tosatti wrote:
> > On Thu, Jan 19, 2006 at 03:50:08PM +0100, Nick Piggin wrote:
> >
> > > The test-set / test-clear operations also kind of imply that it is
> > > being used for locking or without other synchronisation (usually).
> >
> > Non-atomic versions such as __ClearPageLRU()/__ClearPageActive() are
> > not usable, though.
> >
>
> Correct. Although I was able to use them in a couple of other places
> in a subsequent patch in the series. I trust you don't see a problem
> with those usages?

Indeed, sorry. Would you mind adding a comment that page->flags must be
accessed atomically otherwise and that __ versions are special as to
when the page cannot be referenced anymore? (its really not obvious)

Also this comments on top of page-flags.h could be updated

* During disk I/O, PG_locked is used. This bit is set before I/O and
* reset when I/O completes. page_waitqueue(page) is a wait queue of all tasks
* waiting for the I/O on this page to complete.

s/PG_locked/PG_writeback/

* Note that the referenced bit, the page->lru list_head and the active,
* inactive_dirty and inactive_clean lists are protected by the
* zone->lru_lock, and *NOT* by the usual PG_locked bit!

inactive_dirty and inactive_clean do not exist anymore