According to the swapon documentation
Swap pages are allocated from areas in priority order,
highest priority first. For areas with different priorities, a
higher-priority area is exhausted before using a lower-priority area.
A user reported that the reality is different. When multiple swap files
are enabled and a memory consumer started, the swap files are consumed in
pairs after the highest priority file is exhausted. Early in the lifetime
of the test, swapfile consumptions looks like
Filename Type Size Used Priority
/testswap1 file 100004 100004 8
/testswap2 file 100004 23764 7
/testswap3 file 100004 23764 6
/testswap4 file 100004 0 5
/testswap5 file 100004 0 4
/testswap6 file 100004 0 3
/testswap7 file 100004 0 2
/testswap8 file 100004 0 1
This patch fixes the swap_list search in get_swap_page to use the swap files
in the correct order. When applied the swap file consumptions looks like
Filename Type Size Used Priority
/testswap1 file 100004 100004 8
/testswap2 file 100004 100004 7
/testswap3 file 100004 29372 6
/testswap4 file 100004 0 5
/testswap5 file 100004 0 4
/testswap6 file 100004 0 3
/testswap7 file 100004 0 2
/testswap8 file 100004 0 1
Signed-off-by: Mel Gorman <[email protected]>
---
mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4a7f7e6..6d0ac2b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -651,7 +651,7 @@ swp_entry_t get_swap_page(void)
goto noswap;
atomic_long_dec(&nr_swap_pages);
- for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
+ for (type = swap_list.head; type >= 0 && wrapped < 2; type = next) {
hp_index = atomic_xchg(&highest_priority_index, -1);
/*
* highest_priority_index records current highest priority swap
On Thu, Feb 13, 2014 at 6:42 PM, Mel Gorman <[email protected]> wrote:
> According to the swapon documentation
>
> Swap pages are allocated from areas in priority order,
> highest priority first. For areas with different priorities, a
> higher-priority area is exhausted before using a lower-priority area.
>
> A user reported that the reality is different. When multiple swap files
> are enabled and a memory consumer started, the swap files are consumed in
> pairs after the highest priority file is exhausted. Early in the lifetime
> of the test, swapfile consumptions looks like
>
> Filename Type Size Used Priority
> /testswap1 file 100004 100004 8
> /testswap2 file 100004 23764 7
> /testswap3 file 100004 23764 6
> /testswap4 file 100004 0 5
> /testswap5 file 100004 0 4
> /testswap6 file 100004 0 3
> /testswap7 file 100004 0 2
> /testswap8 file 100004 0 1
>
> This patch fixes the swap_list search in get_swap_page to use the swap files
> in the correct order. When applied the swap file consumptions looks like
>
> Filename Type Size Used Priority
> /testswap1 file 100004 100004 8
> /testswap2 file 100004 100004 7
> /testswap3 file 100004 29372 6
> /testswap4 file 100004 0 5
> /testswap5 file 100004 0 4
> /testswap6 file 100004 0 3
> /testswap7 file 100004 0 2
> /testswap8 file 100004 0 1
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/swapfile.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4a7f7e6..6d0ac2b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -651,7 +651,7 @@ swp_entry_t get_swap_page(void)
> goto noswap;
> atomic_long_dec(&nr_swap_pages);
>
> - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
> + for (type = swap_list.head; type >= 0 && wrapped < 2; type = next) {
Does it lead to a "schlemiel the painter's algorithm"?
(please forgive my rude words, but I can't find a precise word to describe it
because English is not my native language. My apologize.)
How about modify it like this?
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4a7f7e6..d64aa55 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -676,7 +676,7 @@ swp_entry_t get_swap_page(void)
next = si->next;
if (next < 0 ||
(!wrapped && si->prio != swap_info[next]->prio)) {
- next = swap_list.head;
+ next = type;
wrapped++;
}
> hp_index = atomic_xchg(&highest_priority_index, -1);
> /*
> * highest_priority_index records current highest priority swap
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Thu, Feb 13, 2014 at 11:58:05PM +0800, Weijie Yang wrote:
> On Thu, Feb 13, 2014 at 6:42 PM, Mel Gorman <[email protected]> wrote:
> > According to the swapon documentation
> >
> > Swap pages are allocated from areas in priority order,
> > highest priority first. For areas with different priorities, a
> > higher-priority area is exhausted before using a lower-priority area.
> >
> > A user reported that the reality is different. When multiple swap files
> > are enabled and a memory consumer started, the swap files are consumed in
> > pairs after the highest priority file is exhausted. Early in the lifetime
> > of the test, swapfile consumptions looks like
> >
> > Filename Type Size Used Priority
> > /testswap1 file 100004 100004 8
> > /testswap2 file 100004 23764 7
> > /testswap3 file 100004 23764 6
> > /testswap4 file 100004 0 5
> > /testswap5 file 100004 0 4
> > /testswap6 file 100004 0 3
> > /testswap7 file 100004 0 2
> > /testswap8 file 100004 0 1
> >
> > This patch fixes the swap_list search in get_swap_page to use the swap files
> > in the correct order. When applied the swap file consumptions looks like
> >
> > Filename Type Size Used Priority
> > /testswap1 file 100004 100004 8
> > /testswap2 file 100004 100004 7
> > /testswap3 file 100004 29372 6
> > /testswap4 file 100004 0 5
> > /testswap5 file 100004 0 4
> > /testswap6 file 100004 0 3
> > /testswap7 file 100004 0 2
> > /testswap8 file 100004 0 1
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/swapfile.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4a7f7e6..6d0ac2b 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -651,7 +651,7 @@ swp_entry_t get_swap_page(void)
> > goto noswap;
> > atomic_long_dec(&nr_swap_pages);
> >
> > - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
> > + for (type = swap_list.head; type >= 0 && wrapped < 2; type = next) {
>
> Does it lead to a "schlemiel the painter's algorithm"?
> (please forgive my rude words, but I can't find a precise word to describe it
> because English is not my native language. My apologize.)
>
> How about modify it like this?
>
I blindly applied your version without review to see how it behaved and
found it uses every second swapfile like this
Filename Type Size Used Priority
/testswap1 file 100004 100004 8
/testswap2 file 100004 16 7
/testswap3 file 100004 100004 6
/testswap4 file 100004 8 5
/testswap5 file 100004 100004 4
/testswap6 file 100004 8 3
/testswap7 file 100004 100004 2
/testswap8 file 100004 23504 1
I admit I did not review the swap priority search algorithm in detail
because the fix superficially looked straight forward but this
alternative is not the answer either.
--
Mel Gorman
SUSE Labs
On Fri, Feb 14, 2014 at 6:17 PM, Mel Gorman <[email protected]> wrote:
> On Thu, Feb 13, 2014 at 11:58:05PM +0800, Weijie Yang wrote:
>> On Thu, Feb 13, 2014 at 6:42 PM, Mel Gorman <[email protected]> wrote:
>> > According to the swapon documentation
>> >
>> > Swap pages are allocated from areas in priority order,
>> > highest priority first. For areas with different priorities, a
>> > higher-priority area is exhausted before using a lower-priority area.
>> >
>> > A user reported that the reality is different. When multiple swap files
>> > are enabled and a memory consumer started, the swap files are consumed in
>> > pairs after the highest priority file is exhausted. Early in the lifetime
>> > of the test, swapfile consumptions looks like
>> >
>> > Filename Type Size Used Priority
>> > /testswap1 file 100004 100004 8
>> > /testswap2 file 100004 23764 7
>> > /testswap3 file 100004 23764 6
>> > /testswap4 file 100004 0 5
>> > /testswap5 file 100004 0 4
>> > /testswap6 file 100004 0 3
>> > /testswap7 file 100004 0 2
>> > /testswap8 file 100004 0 1
>> >
>> > This patch fixes the swap_list search in get_swap_page to use the swap files
>> > in the correct order. When applied the swap file consumptions looks like
>> >
>> > Filename Type Size Used Priority
>> > /testswap1 file 100004 100004 8
>> > /testswap2 file 100004 100004 7
>> > /testswap3 file 100004 29372 6
>> > /testswap4 file 100004 0 5
>> > /testswap5 file 100004 0 4
>> > /testswap6 file 100004 0 3
>> > /testswap7 file 100004 0 2
>> > /testswap8 file 100004 0 1
>> >
>> > Signed-off-by: Mel Gorman <[email protected]>
>> > ---
>> > mm/swapfile.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 4a7f7e6..6d0ac2b 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -651,7 +651,7 @@ swp_entry_t get_swap_page(void)
>> > goto noswap;
>> > atomic_long_dec(&nr_swap_pages);
>> >
>> > - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
>> > + for (type = swap_list.head; type >= 0 && wrapped < 2; type = next) {
>>
>> Does it lead to a "schlemiel the painter's algorithm"?
>> (please forgive my rude words, but I can't find a precise word to describe it
>> because English is not my native language. My apologize.)
>>
>> How about modify it like this?
>>
>
> I blindly applied your version without review to see how it behaved and
> found it uses every second swapfile like this
I am sorry to waste your time, I should have tested it.
I will review the code more carefully, and send a tested patch if I find a
better way.
Apologize again.
> Filename Type Size Used Priority
> /testswap1 file 100004 100004 8
> /testswap2 file 100004 16 7
> /testswap3 file 100004 100004 6
> /testswap4 file 100004 8 5
> /testswap5 file 100004 100004 4
> /testswap6 file 100004 8 3
> /testswap7 file 100004 100004 2
> /testswap8 file 100004 23504 1
>
> I admit I did not review the swap priority search algorithm in detail
> because the fix superficially looked straight forward but this
> alternative is not the answer either.
>
> --
> Mel Gorman
> SUSE Labs
On Sun, 16 Feb 2014, Weijie Yang wrote:
> On Fri, Feb 14, 2014 at 9:10 PM, Christian Ehrhardt
> <[email protected]> wrote:
> > Weijie Yang <weijie.yang.kh <at> gmail.com> writes:
> >
> >>
> >> On Thu, Feb 13, 2014 at 6:42 PM, Mel Gorman <mgorman <at> suse.de> wrote:
> > [...]
> >> > - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
> >> > + for (type = swap_list.head; type >= 0 && wrapped < 2; type = next) {
> >>
> > [...]
> >> Does it lead to a "schlemiel the painter's algorithm"?
> >> (please forgive my rude words, but I can't find a precise word to describe it
> >>
> >> How about modify it like this?
> >>
> > [...]
> >> - next = swap_list.head;
> >> + next = type;
> > [...]
> >
> > Hi,
> > unfortunately withou studying the code more thoroughly I'm not even sure if
> > you meant you code to extend or replace Mels patch.
> >
> > To be sure about your intention. You refered to algorithm scaling because
> > you were afraid the new code would scan the full list all the time right ?
> >
> > But simply letting the machines give a try for both options I can now
> > qualify both.
> >
> > Just your patch creates a behaviour of jumping over priorities (see the
> > following example), so I hope you meant combining both patches.
> > With that in mind the patch I eventually tested the combined patch looking
> > like this:
>
> Hi Christian,
>
> My patch is not appropriate, so there is no need to combine it with Mel's patch.
>
> What I worried about Mel's patch is not only the search efficiency,
> actually it has
> negligible impact on system, but also the following scenario:
>
> If two swapfiles have the same priority, in ordinary semantic, they
> should be used
> in balance. But with Mel's patch, it will always get the free
> swap_entry from the
> swap_list.head in priority order, I worry it could break the balance.
>
> I think you can test this scenario if you have available test machines.
>
> Appreciate for your done.
Weijie, I agree with you on both points: Schlemiel effect of repeatedly
restarting from head (already an unintended defect before Mel's patch),
and more importantly the breakage of swapfiles at the same priority.
Sorry, it has to be a Nak to Mel's patch, which fixes one behavior
at the expense of another. And if we were to go that way, better
just to rip out all of swap_list.next and highest_priority_index.
I had hoped to respond today with a better patch; but I just haven't
got it right yet either. I think we don't need to rush to fix it,
but fix it we certainly should.
Christian, congratulations on discovering this wrong behavior: at
first I assumed it came from Shaohua's 3.9 highest_priority_index
changes, but no; then I assumed it came from my 2.6.14 swap_lock
changes; but now I think it goes back even before 2.4.0, probably
ever since there have been swap priorities.
Hugh
The logic controlling the singly-linked list of swap_info_struct entries
for all active, i.e. swapon'ed, swap targets is rather complex, because:
-it stores the entries in priority order
-there is a pointer to the highest priority entry
-there is a pointer to the highest priority not-full entry
-there is a highest_priority_index variable set outside the swap_lock
-swap entries of equal priority should be used equally
this complexity leads to bugs such as:
https://lkml.org/lkml/2014/2/13/181
where different priority swap targets are incorrectly used equally.
That bug probably could be solved with the existing singly-linked lists,
but I think it would only add more complexity to the already difficult
to understand get_swap_page() swap_list iteration logic.
The first patch changes from a singly-linked list to a doubly-linked
list using list_heads; the highest_priority_index and related code are
removed and get_swap_page() starts each iteration at the highest priority
swap_info entry, even if it's full. While this does introduce
unnecessary list iteration (i.e. Schlemiel the painter's algorithm)
in the case where one or more of the highest priority entries are full,
the iteration and manipulation code is much simpler and behaves
correctly re: the above bug; and the second patch removes the unnecessary
iteration.
The second patch adds a new list that contains only swap_info entries
that are both active and not full, and a new spinlock to protect it.
This allows swap_info entries to be added or removed from the new list
immediately as they become full or not full.
Dan Streetman (2):
swap: change swap_info singly-linked list to list_head
swap: use separate priority list for available swap_infos
include/linux/swap.h | 8 +--
include/linux/swapfile.h | 2 +-
mm/frontswap.c | 13 ++---
mm/swapfile.c | 212 ++++++++++++++++++++++++++++++++++---------------------------------
4 files changed, 114 insertions(+), 121 deletions(-)
Replace the singly-linked list tracking active, i.e. swapon'ed,
swap_info_struct entries with a doubly-linked list using struct list_heads.
Simplify the logic iterating and manipulating the list of entries,
especially get_swap_page(), by using standard list_head functions,
and removing the highest priority iteration logic.
The change fixes the bug:
https://lkml.org/lkml/2014/2/13/181
in which different priority swap entries after the highest priority entry
are incorrectly used equally in pairs. The swap behavior is now as
advertised, i.e. different priority swap entries are used in order, and
equal priority swap targets are used concurrently.
Signed-off-by: Dan Streetman <[email protected]>
---
include/linux/swap.h | 7 +--
include/linux/swapfile.h | 2 +-
mm/frontswap.c | 13 ++--
mm/swapfile.c | 156 +++++++++++++++++------------------------------
4 files changed, 63 insertions(+), 115 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..96662d8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -214,8 +214,8 @@ struct percpu_cluster {
struct swap_info_struct {
unsigned long flags; /* SWP_USED etc: see above */
signed short prio; /* swap priority of this type */
+ struct list_head list; /* entry in swap list */
signed char type; /* strange name for an index */
- signed char next; /* next type on the swap list */
unsigned int max; /* extent of the swap_map */
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
@@ -255,11 +255,6 @@ struct swap_info_struct {
struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
};
-struct swap_list_t {
- int head; /* head of priority-ordered swapfile list */
- int next; /* swapfile to be used next */
-};
-
/* linux/mm/workingset.c */
void *workingset_eviction(struct address_space *mapping, struct page *page);
bool workingset_refault(void *shadow);
diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index e282624..2eab382 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -6,7 +6,7 @@
* want to expose them to the dozens of source files that include swap.h
*/
extern spinlock_t swap_lock;
-extern struct swap_list_t swap_list;
+extern struct list_head swap_list_head;
extern struct swap_info_struct *swap_info[];
extern int try_to_unuse(unsigned int, bool, unsigned long);
diff --git a/mm/frontswap.c b/mm/frontswap.c
index 1b24bdc..fae1160 100644
--- a/mm/frontswap.c
+++ b/mm/frontswap.c
@@ -327,15 +327,12 @@ EXPORT_SYMBOL(__frontswap_invalidate_area);
static unsigned long __frontswap_curr_pages(void)
{
- int type;
unsigned long totalpages = 0;
struct swap_info_struct *si = NULL;
assert_spin_locked(&swap_lock);
- for (type = swap_list.head; type >= 0; type = si->next) {
- si = swap_info[type];
+ list_for_each_entry(si, &swap_list_head, list)
totalpages += atomic_read(&si->frontswap_pages);
- }
return totalpages;
}
@@ -347,11 +344,9 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
int si_frontswap_pages;
unsigned long total_pages_to_unuse = total;
unsigned long pages = 0, pages_to_unuse = 0;
- int type;
assert_spin_locked(&swap_lock);
- for (type = swap_list.head; type >= 0; type = si->next) {
- si = swap_info[type];
+ list_for_each_entry(si, &swap_list_head, list) {
si_frontswap_pages = atomic_read(&si->frontswap_pages);
if (total_pages_to_unuse < si_frontswap_pages) {
pages = pages_to_unuse = total_pages_to_unuse;
@@ -366,7 +361,7 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
}
vm_unacct_memory(pages);
*unused = pages_to_unuse;
- *swapid = type;
+ *swapid = si->type;
ret = 0;
break;
}
@@ -413,7 +408,7 @@ void frontswap_shrink(unsigned long target_pages)
/*
* we don't want to hold swap_lock while doing a very
* lengthy try_to_unuse, but swap_list may change
- * so restart scan from swap_list.head each time
+ * so restart scan from swap_list_head each time
*/
spin_lock(&swap_lock);
ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4a7f7e6..b958645 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -51,14 +51,14 @@ atomic_long_t nr_swap_pages;
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
long total_swap_pages;
static int least_priority;
-static atomic_t highest_priority_index = ATOMIC_INIT(-1);
static const char Bad_file[] = "Bad swap file entry ";
static const char Unused_file[] = "Unused swap file entry ";
static const char Bad_offset[] = "Bad swap offset entry ";
static const char Unused_offset[] = "Unused swap offset entry ";
-struct swap_list_t swap_list = {-1, -1};
+/* all active swap_info */
+LIST_HEAD(swap_list_head);
struct swap_info_struct *swap_info[MAX_SWAPFILES];
@@ -640,66 +640,50 @@ no_page:
swp_entry_t get_swap_page(void)
{
- struct swap_info_struct *si;
+ struct swap_info_struct *si, *next;
pgoff_t offset;
- int type, next;
- int wrapped = 0;
- int hp_index;
+ struct list_head *tmp;
spin_lock(&swap_lock);
if (atomic_long_read(&nr_swap_pages) <= 0)
goto noswap;
atomic_long_dec(&nr_swap_pages);
- for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
- hp_index = atomic_xchg(&highest_priority_index, -1);
- /*
- * highest_priority_index records current highest priority swap
- * type which just frees swap entries. If its priority is
- * higher than that of swap_list.next swap type, we use it. It
- * isn't protected by swap_lock, so it can be an invalid value
- * if the corresponding swap type is swapoff. We double check
- * the flags here. It's even possible the swap type is swapoff
- * and swapon again and its priority is changed. In such rare
- * case, low prority swap type might be used, but eventually
- * high priority swap will be used after several rounds of
- * swap.
- */
- if (hp_index != -1 && hp_index != type &&
- swap_info[type]->prio < swap_info[hp_index]->prio &&
- (swap_info[hp_index]->flags & SWP_WRITEOK)) {
- type = hp_index;
- swap_list.next = type;
- }
-
- si = swap_info[type];
- next = si->next;
- if (next < 0 ||
- (!wrapped && si->prio != swap_info[next]->prio)) {
- next = swap_list.head;
- wrapped++;
- }
-
+ list_for_each(tmp, &swap_list_head) {
+ si = list_entry(tmp, typeof(*si), list);
spin_lock(&si->lock);
- if (!si->highest_bit) {
- spin_unlock(&si->lock);
- continue;
- }
- if (!(si->flags & SWP_WRITEOK)) {
+ if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
spin_unlock(&si->lock);
continue;
}
- swap_list.next = next;
+ /*
+ * rotate the current swap_info that we're going to use
+ * to after any other swap_info that have the same prio,
+ * so that all equal-priority swap_info get used equally
+ */
+ next = si;
+ list_for_each_entry_continue(next, &swap_list_head, list) {
+ if (si->prio != next->prio)
+ break;
+ list_rotate_left(&si->list);
+ next = si;
+ }
spin_unlock(&swap_lock);
/* This is called for allocating swap entry for cache */
offset = scan_swap_map(si, SWAP_HAS_CACHE);
spin_unlock(&si->lock);
if (offset)
- return swp_entry(type, offset);
+ return swp_entry(si->type, offset);
spin_lock(&swap_lock);
- next = swap_list.next;
+ /*
+ * shouldn't really have got here, but for some reason the
+ * scan_swap_map came back empty for this swap_info.
+ * Since we dropped the swap_lock, there may now be
+ * non-full higher prio swap_infos; let's start over.
+ */
+ tmp = &swap_list_head;
}
atomic_long_inc(&nr_swap_pages);
@@ -766,27 +750,6 @@ out:
return NULL;
}
-/*
- * This swap type frees swap entry, check if it is the highest priority swap
- * type which just frees swap entry. get_swap_page() uses
- * highest_priority_index to search highest priority swap type. The
- * swap_info_struct.lock can't protect us if there are multiple swap types
- * active, so we use atomic_cmpxchg.
- */
-static void set_highest_priority_index(int type)
-{
- int old_hp_index, new_hp_index;
-
- do {
- old_hp_index = atomic_read(&highest_priority_index);
- if (old_hp_index != -1 &&
- swap_info[old_hp_index]->prio >= swap_info[type]->prio)
- break;
- new_hp_index = type;
- } while (atomic_cmpxchg(&highest_priority_index,
- old_hp_index, new_hp_index) != old_hp_index);
-}
-
static unsigned char swap_entry_free(struct swap_info_struct *p,
swp_entry_t entry, unsigned char usage)
{
@@ -830,7 +793,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
p->lowest_bit = offset;
if (offset > p->highest_bit)
p->highest_bit = offset;
- set_highest_priority_index(p->type);
atomic_long_inc(&nr_swap_pages);
p->inuse_pages--;
frontswap_invalidate_page(p->type, offset);
@@ -1765,7 +1727,7 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
unsigned char *swap_map,
struct swap_cluster_info *cluster_info)
{
- int i, prev;
+ struct swap_info_struct *si;
if (prio >= 0)
p->prio = prio;
@@ -1777,18 +1739,21 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
atomic_long_add(p->pages, &nr_swap_pages);
total_swap_pages += p->pages;
- /* insert swap space into swap_list: */
- prev = -1;
- for (i = swap_list.head; i >= 0; i = swap_info[i]->next) {
- if (p->prio >= swap_info[i]->prio)
- break;
- prev = i;
+ assert_spin_locked(&swap_lock);
+ BUG_ON(!list_empty(&p->list));
+ /* insert into swap list: */
+ list_for_each_entry(si, &swap_list_head, list) {
+ if (p->prio >= si->prio) {
+ list_add_tail(&p->list, &si->list);
+ return;
+ }
}
- p->next = i;
- if (prev < 0)
- swap_list.head = swap_list.next = p->type;
- else
- swap_info[prev]->next = p->type;
+ /*
+ * this covers two cases:
+ * 1) p->prio is less than all existing prio
+ * 2) the swap list is empty
+ */
+ list_add_tail(&p->list, &swap_list_head);
}
static void enable_swap_info(struct swap_info_struct *p, int prio,
@@ -1823,8 +1788,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
struct address_space *mapping;
struct inode *inode;
struct filename *pathname;
- int i, type, prev;
- int err;
+ int err, found = 0;
unsigned int old_block_size;
if (!capable(CAP_SYS_ADMIN))
@@ -1842,17 +1806,16 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
goto out;
mapping = victim->f_mapping;
- prev = -1;
spin_lock(&swap_lock);
- for (type = swap_list.head; type >= 0; type = swap_info[type]->next) {
- p = swap_info[type];
+ list_for_each_entry(p, &swap_list_head, list) {
if (p->flags & SWP_WRITEOK) {
- if (p->swap_file->f_mapping == mapping)
+ if (p->swap_file->f_mapping == mapping) {
+ found = 1;
break;
+ }
}
- prev = type;
}
- if (type < 0) {
+ if (!found) {
err = -EINVAL;
spin_unlock(&swap_lock);
goto out_dput;
@@ -1864,20 +1827,15 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
spin_unlock(&swap_lock);
goto out_dput;
}
- if (prev < 0)
- swap_list.head = p->next;
- else
- swap_info[prev]->next = p->next;
- if (type == swap_list.next) {
- /* just pick something that's safe... */
- swap_list.next = swap_list.head;
- }
spin_lock(&p->lock);
if (p->prio < 0) {
- for (i = p->next; i >= 0; i = swap_info[i]->next)
- swap_info[i]->prio = p->prio--;
+ struct swap_info_struct *si = p;
+ list_for_each_entry_continue(si, &swap_list_head, list) {
+ si->prio++;
+ }
least_priority++;
}
+ list_del_init(&p->list);
atomic_long_sub(p->pages, &nr_swap_pages);
total_swap_pages -= p->pages;
p->flags &= ~SWP_WRITEOK;
@@ -1885,7 +1843,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
spin_unlock(&swap_lock);
set_current_oom_origin();
- err = try_to_unuse(type, false, 0); /* force all pages to be unused */
+ err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
clear_current_oom_origin();
if (err) {
@@ -1926,7 +1884,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
frontswap_map = frontswap_map_get(p);
spin_unlock(&p->lock);
spin_unlock(&swap_lock);
- frontswap_invalidate_area(type);
+ frontswap_invalidate_area(p->type);
frontswap_map_set(p, NULL);
mutex_unlock(&swapon_mutex);
free_percpu(p->percpu_cluster);
@@ -1935,7 +1893,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
vfree(cluster_info);
vfree(frontswap_map);
/* Destroy swap account information */
- swap_cgroup_swapoff(type);
+ swap_cgroup_swapoff(p->type);
inode = mapping->host;
if (S_ISBLK(inode->i_mode)) {
@@ -2142,8 +2100,8 @@ static struct swap_info_struct *alloc_swap_info(void)
*/
}
INIT_LIST_HEAD(&p->first_swap_extent.list);
+ INIT_LIST_HEAD(&p->list);
p->flags = SWP_USED;
- p->next = -1;
spin_unlock(&swap_lock);
spin_lock_init(&p->lock);
--
1.8.3.1
Originally get_swap_page() started iterating through the singly-linked
list of swap_info_structs using swap_list.next or highest_priority_index,
which both were intended to point to the highest priority active swap
target that was not full. The previous patch in this series changed the
singly-linked list to a doubly-linked list, and removed the logic to start
at the highest priority non-full entry; it starts scanning at the highest
priority entry each time, even if the entry is full.
Add a new list, also priority ordered, to track only swap_info_structs
that are available, i.e. active and not full. Use a new spinlock so that
entries can be added/removed outside of get_swap_page; that wasn't possible
previously because the main list is protected by swap_lock, which can't be
taken when holding a swap_info_struct->lock because of locking order.
The get_swap_page() logic now does not need to hold the swap_lock, and it
iterates only through swap_info_structs that are available.
Signed-off-by: Dan Streetman <[email protected]>
---
include/linux/swap.h | 1 +
mm/swapfile.c | 128 ++++++++++++++++++++++++++++++++++-----------------
2 files changed, 87 insertions(+), 42 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 96662d8..d9263db 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -214,6 +214,7 @@ struct percpu_cluster {
struct swap_info_struct {
unsigned long flags; /* SWP_USED etc: see above */
signed short prio; /* swap priority of this type */
+ struct list_head prio_list; /* entry in priority list */
struct list_head list; /* entry in swap list */
signed char type; /* strange name for an index */
unsigned int max; /* extent of the swap_map */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b958645..3c38461 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -57,9 +57,13 @@ static const char Unused_file[] = "Unused swap file entry ";
static const char Bad_offset[] = "Bad swap offset entry ";
static const char Unused_offset[] = "Unused swap offset entry ";
-/* all active swap_info */
+/* all active swap_info; protected with swap_lock */
LIST_HEAD(swap_list_head);
+/* all available (active, not full) swap_info, priority ordered */
+static LIST_HEAD(prio_head);
+static DEFINE_SPINLOCK(prio_lock);
+
struct swap_info_struct *swap_info[MAX_SWAPFILES];
static DEFINE_MUTEX(swapon_mutex);
@@ -73,6 +77,27 @@ static inline unsigned char swap_count(unsigned char ent)
return ent & ~SWAP_HAS_CACHE; /* may include SWAP_HAS_CONT flag */
}
+/*
+ * add, in priority order, swap_info (p)->(le) list_head to list (lh)
+ * this list-generic function is needed because both swap_list_head
+ * and prio_head need to be priority ordered:
+ * swap_list_head in swapoff to adjust lower negative prio swap_infos
+ * prio_list in get_swap_page to scan highest prio swap_info first
+ */
+#define swap_info_list_add(p, lh, le) do { \
+ struct swap_info_struct *_si; \
+ BUG_ON(!list_empty(&(p)->le)); \
+ list_for_each_entry(_si, (lh), le) { \
+ if ((p)->prio >= _si->prio) { \
+ list_add_tail(&(p)->le, &_si->le); \
+ break; \
+ } \
+ } \
+ /* lh empty, or p lowest prio */ \
+ if (list_empty(&(p)->le)) \
+ list_add_tail(&(p)->le, (lh)); \
+} while (0)
+
/* returns 1 if swap entry is freed */
static int
__try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset)
@@ -591,6 +616,9 @@ checks:
if (si->inuse_pages == si->pages) {
si->lowest_bit = si->max;
si->highest_bit = 0;
+ spin_lock(&prio_lock);
+ list_del_init(&si->prio_list);
+ spin_unlock(&prio_lock);
}
si->swap_map[offset] = usage;
inc_cluster_info_page(si, si->cluster_info, offset);
@@ -642,53 +670,68 @@ swp_entry_t get_swap_page(void)
{
struct swap_info_struct *si, *next;
pgoff_t offset;
- struct list_head *tmp;
- spin_lock(&swap_lock);
if (atomic_long_read(&nr_swap_pages) <= 0)
goto noswap;
atomic_long_dec(&nr_swap_pages);
- list_for_each(tmp, &swap_list_head) {
- si = list_entry(tmp, typeof(*si), list);
- spin_lock(&si->lock);
- if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
- spin_unlock(&si->lock);
- continue;
- }
-
+ spin_lock(&prio_lock);
+start_over:
+ list_for_each_entry_safe(si, next, &prio_head, prio_list) {
/*
- * rotate the current swap_info that we're going to use
+ * rotate the current swap_info that we're checking
* to after any other swap_info that have the same prio,
* so that all equal-priority swap_info get used equally
*/
- next = si;
- list_for_each_entry_continue(next, &swap_list_head, list) {
- if (si->prio != next->prio)
+ struct swap_info_struct *eq_prio = si;
+ list_for_each_entry_continue(eq_prio, &prio_head, prio_list) {
+ if (si->prio != eq_prio->prio)
break;
- list_rotate_left(&si->list);
- next = si;
+ list_rotate_left(&si->prio_list);
+ eq_prio = si;
+ }
+ spin_unlock(&prio_lock);
+ spin_lock(&si->lock);
+ if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
+ spin_lock(&prio_lock);
+ if (list_empty(&si->prio_list)) {
+ spin_unlock(&si->lock);
+ goto nextsi;
+ }
+ WARN(!si->highest_bit,
+ "swap_info %d in list but !highest_bit\n",
+ si->type);
+ WARN(!(si->flags & SWP_WRITEOK),
+ "swap_info %d in list but !SWP_WRITEOK\n",
+ si->type);
+ list_del_init(&si->prio_list);
+ spin_unlock(&si->lock);
+ goto nextsi;
}
- spin_unlock(&swap_lock);
/* This is called for allocating swap entry for cache */
offset = scan_swap_map(si, SWAP_HAS_CACHE);
spin_unlock(&si->lock);
if (offset)
return swp_entry(si->type, offset);
- spin_lock(&swap_lock);
+ printk(KERN_DEBUG "scan_swap_map of si %d failed to find offset\n",
+ si->type);
+ spin_lock(&prio_lock);
+nextsi:
/*
- * shouldn't really have got here, but for some reason the
- * scan_swap_map came back empty for this swap_info.
- * Since we dropped the swap_lock, there may now be
- * non-full higher prio swap_infos; let's start over.
+ * shouldn't really have got here. either si was
+ * in the prio_head list but was full or !writeok, or
+ * scan_swap_map came back empty. Since we dropped
+ * the prio_lock, the prio_head list may have been
+ * modified; so if next is still in the prio_head
+ * list then try it, otherwise start over.
*/
- tmp = &swap_list_head;
+ if (list_empty(&next->prio_list))
+ goto start_over;
}
atomic_long_inc(&nr_swap_pages);
noswap:
- spin_unlock(&swap_lock);
return (swp_entry_t) {0};
}
@@ -791,8 +834,17 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
dec_cluster_info_page(p, p->cluster_info, offset);
if (offset < p->lowest_bit)
p->lowest_bit = offset;
- if (offset > p->highest_bit)
+ if (offset > p->highest_bit) {
+ bool was_full = !p->highest_bit;
p->highest_bit = offset;
+ if (was_full && (p->flags & SWP_WRITEOK)) {
+ spin_lock(&prio_lock);
+ if (list_empty(&p->prio_list))
+ swap_info_list_add(p, &prio_head,
+ prio_list);
+ spin_unlock(&prio_lock);
+ }
+ }
atomic_long_inc(&nr_swap_pages);
p->inuse_pages--;
frontswap_invalidate_page(p->type, offset);
@@ -1727,8 +1779,6 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
unsigned char *swap_map,
struct swap_cluster_info *cluster_info)
{
- struct swap_info_struct *si;
-
if (prio >= 0)
p->prio = prio;
else
@@ -1740,20 +1790,10 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
total_swap_pages += p->pages;
assert_spin_locked(&swap_lock);
- BUG_ON(!list_empty(&p->list));
- /* insert into swap list: */
- list_for_each_entry(si, &swap_list_head, list) {
- if (p->prio >= si->prio) {
- list_add_tail(&p->list, &si->list);
- return;
- }
- }
- /*
- * this covers two cases:
- * 1) p->prio is less than all existing prio
- * 2) the swap list is empty
- */
- list_add_tail(&p->list, &swap_list_head);
+ swap_info_list_add(p, &swap_list_head, list);
+ spin_lock(&prio_lock);
+ swap_info_list_add(p, &prio_head, prio_list);
+ spin_unlock(&prio_lock);
}
static void enable_swap_info(struct swap_info_struct *p, int prio,
@@ -1827,6 +1867,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
spin_unlock(&swap_lock);
goto out_dput;
}
+ spin_lock(&prio_lock);
+ list_del_init(&p->prio_list);
+ spin_unlock(&prio_lock);
spin_lock(&p->lock);
if (p->prio < 0) {
struct swap_info_struct *si = p;
@@ -2101,6 +2144,7 @@ static struct swap_info_struct *alloc_swap_info(void)
}
INIT_LIST_HEAD(&p->first_swap_extent.list);
INIT_LIST_HEAD(&p->list);
+ INIT_LIST_HEAD(&p->prio_list);
p->flags = SWP_USED;
spin_unlock(&swap_lock);
spin_lock_init(&p->lock);
--
1.8.3.1
On Sat, Apr 12, 2014 at 05:00:53PM -0400, Dan Streetman wrote:
> Replace the singly-linked list tracking active, i.e. swapon'ed,
> swap_info_struct entries with a doubly-linked list using struct list_heads.
> Simplify the logic iterating and manipulating the list of entries,
> especially get_swap_page(), by using standard list_head functions,
> and removing the highest priority iteration logic.
>
> The change fixes the bug:
> https://lkml.org/lkml/2014/2/13/181
> in which different priority swap entries after the highest priority entry
> are incorrectly used equally in pairs. The swap behavior is now as
> advertised, i.e. different priority swap entries are used in order, and
> equal priority swap targets are used concurrently.
>
> Signed-off-by: Dan Streetman <[email protected]>
> ---
> include/linux/swap.h | 7 +--
> include/linux/swapfile.h | 2 +-
> mm/frontswap.c | 13 ++--
> mm/swapfile.c | 156 +++++++++++++++++------------------------------
> 4 files changed, 63 insertions(+), 115 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3507115..96662d8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -214,8 +214,8 @@ struct percpu_cluster {
> struct swap_info_struct {
> unsigned long flags; /* SWP_USED etc: see above */
> signed short prio; /* swap priority of this type */
> + struct list_head list; /* entry in swap list */
> signed char type; /* strange name for an index */
> - signed char next; /* next type on the swap list */
> unsigned int max; /* extent of the swap_map */
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> @@ -255,11 +255,6 @@ struct swap_info_struct {
> struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
> };
>
> -struct swap_list_t {
> - int head; /* head of priority-ordered swapfile list */
> - int next; /* swapfile to be used next */
> -};
> -
> /* linux/mm/workingset.c */
> void *workingset_eviction(struct address_space *mapping, struct page *page);
> bool workingset_refault(void *shadow);
> diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
> index e282624..2eab382 100644
> --- a/include/linux/swapfile.h
> +++ b/include/linux/swapfile.h
> @@ -6,7 +6,7 @@
> * want to expose them to the dozens of source files that include swap.h
> */
> extern spinlock_t swap_lock;
> -extern struct swap_list_t swap_list;
> +extern struct list_head swap_list_head;
> extern struct swap_info_struct *swap_info[];
> extern int try_to_unuse(unsigned int, bool, unsigned long);
>
> diff --git a/mm/frontswap.c b/mm/frontswap.c
> index 1b24bdc..fae1160 100644
> --- a/mm/frontswap.c
> +++ b/mm/frontswap.c
> @@ -327,15 +327,12 @@ EXPORT_SYMBOL(__frontswap_invalidate_area);
>
> static unsigned long __frontswap_curr_pages(void)
> {
> - int type;
> unsigned long totalpages = 0;
> struct swap_info_struct *si = NULL;
>
> assert_spin_locked(&swap_lock);
> - for (type = swap_list.head; type >= 0; type = si->next) {
> - si = swap_info[type];
> + list_for_each_entry(si, &swap_list_head, list)
> totalpages += atomic_read(&si->frontswap_pages);
> - }
> return totalpages;
> }
>
> @@ -347,11 +344,9 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
> int si_frontswap_pages;
> unsigned long total_pages_to_unuse = total;
> unsigned long pages = 0, pages_to_unuse = 0;
> - int type;
>
> assert_spin_locked(&swap_lock);
> - for (type = swap_list.head; type >= 0; type = si->next) {
> - si = swap_info[type];
> + list_for_each_entry(si, &swap_list_head, list) {
> si_frontswap_pages = atomic_read(&si->frontswap_pages);
> if (total_pages_to_unuse < si_frontswap_pages) {
> pages = pages_to_unuse = total_pages_to_unuse;
The frontswap shrink code looks suspicious. If the target is smaller than
the total number of frontswap pages then it does nothing. The callers
appear to get this right at least. Similarly, if the first swapfile has
fewer frontswap pages than the target then it does not unuse the target
number of pages because it only handles one swap file. It's outside the
scope of your patch to address this or wonder if xen balloon driver is
really using it the way it's expected.
The old code scanned the files in priority order. Superficially this does
not appear to but it actually does because you add the swap files to the
list in priority order during swapon. If you do another revision it's
worth adding a comment above swap_list_head that the list is ordered by
priority and protected by the swap_lock.
> @@ -366,7 +361,7 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
> }
> vm_unacct_memory(pages);
> *unused = pages_to_unuse;
> - *swapid = type;
> + *swapid = si->type;
> ret = 0;
> break;
> }
> @@ -413,7 +408,7 @@ void frontswap_shrink(unsigned long target_pages)
> /*
> * we don't want to hold swap_lock while doing a very
> * lengthy try_to_unuse, but swap_list may change
> - * so restart scan from swap_list.head each time
> + * so restart scan from swap_list_head each time
> */
> spin_lock(&swap_lock);
> ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4a7f7e6..b958645 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -51,14 +51,14 @@ atomic_long_t nr_swap_pages;
> /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> long total_swap_pages;
> static int least_priority;
> -static atomic_t highest_priority_index = ATOMIC_INIT(-1);
>
> static const char Bad_file[] = "Bad swap file entry ";
> static const char Unused_file[] = "Unused swap file entry ";
> static const char Bad_offset[] = "Bad swap offset entry ";
> static const char Unused_offset[] = "Unused swap offset entry ";
>
> -struct swap_list_t swap_list = {-1, -1};
> +/* all active swap_info */
> +LIST_HEAD(swap_list_head);
>
> struct swap_info_struct *swap_info[MAX_SWAPFILES];
>
> @@ -640,66 +640,50 @@ no_page:
>
> swp_entry_t get_swap_page(void)
> {
> - struct swap_info_struct *si;
> + struct swap_info_struct *si, *next;
> pgoff_t offset;
> - int type, next;
> - int wrapped = 0;
> - int hp_index;
> + struct list_head *tmp;
>
> spin_lock(&swap_lock);
> if (atomic_long_read(&nr_swap_pages) <= 0)
> goto noswap;
> atomic_long_dec(&nr_swap_pages);
>
> - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
> - hp_index = atomic_xchg(&highest_priority_index, -1);
> - /*
> - * highest_priority_index records current highest priority swap
> - * type which just frees swap entries. If its priority is
> - * higher than that of swap_list.next swap type, we use it. It
> - * isn't protected by swap_lock, so it can be an invalid value
> - * if the corresponding swap type is swapoff. We double check
> - * the flags here. It's even possible the swap type is swapoff
> - * and swapon again and its priority is changed. In such rare
> - * case, low prority swap type might be used, but eventually
> - * high priority swap will be used after several rounds of
> - * swap.
> - */
> - if (hp_index != -1 && hp_index != type &&
> - swap_info[type]->prio < swap_info[hp_index]->prio &&
> - (swap_info[hp_index]->flags & SWP_WRITEOK)) {
> - type = hp_index;
> - swap_list.next = type;
> - }
> -
> - si = swap_info[type];
> - next = si->next;
> - if (next < 0 ||
> - (!wrapped && si->prio != swap_info[next]->prio)) {
> - next = swap_list.head;
> - wrapped++;
> - }
> -
> + list_for_each(tmp, &swap_list_head) {
> + si = list_entry(tmp, typeof(*si), list);
> spin_lock(&si->lock);
> - if (!si->highest_bit) {
> - spin_unlock(&si->lock);
> - continue;
> - }
> - if (!(si->flags & SWP_WRITEOK)) {
> + if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> spin_unlock(&si->lock);
> continue;
> }
>
> - swap_list.next = next;
> + /*
> + * rotate the current swap_info that we're going to use
> + * to after any other swap_info that have the same prio,
> + * so that all equal-priority swap_info get used equally
> + */
> + next = si;
> + list_for_each_entry_continue(next, &swap_list_head, list) {
> + if (si->prio != next->prio)
> + break;
> + list_rotate_left(&si->list);
> + next = si;
> + }
>
The list manipulations will be a lot of cache writes as the list is shuffled
around. On slow storage I do not think this will be noticable but it may
be noticable on faster swap devices that are SSD based. I've added Shaohua
Li to the cc as he has been concerned with the performance of swap in the
past. Shaohua, can you run this patchset through any of your test cases
with the addition that multiple swap files are used to see if the cache
writes are noticable? You'll need multiple swap files, some of which are
at equal priority so the list shuffling logic is triggered.
> spin_unlock(&swap_lock);
> /* This is called for allocating swap entry for cache */
> offset = scan_swap_map(si, SWAP_HAS_CACHE);
> spin_unlock(&si->lock);
> if (offset)
> - return swp_entry(type, offset);
> + return swp_entry(si->type, offset);
> spin_lock(&swap_lock);
> - next = swap_list.next;
> + /*
> + * shouldn't really have got here, but for some reason the
> + * scan_swap_map came back empty for this swap_info.
> + * Since we dropped the swap_lock, there may now be
> + * non-full higher prio swap_infos; let's start over.
> + */
> + tmp = &swap_list_head;
> }
Has this ever triggered? The number of swap pages was examined under the
swap lock so no other process should have been iterating through the
swap files. Once a candidate was found, the si lock was acquired for the
swap scan map so nothing else should have raced with it.
>
> atomic_long_inc(&nr_swap_pages);
> @@ -766,27 +750,6 @@ out:
> return NULL;
> }
>
> -/*
> - * This swap type frees swap entry, check if it is the highest priority swap
> - * type which just frees swap entry. get_swap_page() uses
> - * highest_priority_index to search highest priority swap type. The
> - * swap_info_struct.lock can't protect us if there are multiple swap types
> - * active, so we use atomic_cmpxchg.
> - */
> -static void set_highest_priority_index(int type)
> -{
> - int old_hp_index, new_hp_index;
> -
> - do {
> - old_hp_index = atomic_read(&highest_priority_index);
> - if (old_hp_index != -1 &&
> - swap_info[old_hp_index]->prio >= swap_info[type]->prio)
> - break;
> - new_hp_index = type;
> - } while (atomic_cmpxchg(&highest_priority_index,
> - old_hp_index, new_hp_index) != old_hp_index);
> -}
> -
> static unsigned char swap_entry_free(struct swap_info_struct *p,
> swp_entry_t entry, unsigned char usage)
> {
> @@ -830,7 +793,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> p->lowest_bit = offset;
> if (offset > p->highest_bit)
> p->highest_bit = offset;
> - set_highest_priority_index(p->type);
> atomic_long_inc(&nr_swap_pages);
> p->inuse_pages--;
> frontswap_invalidate_page(p->type, offset);
> @@ -1765,7 +1727,7 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
> unsigned char *swap_map,
> struct swap_cluster_info *cluster_info)
> {
> - int i, prev;
> + struct swap_info_struct *si;
>
> if (prio >= 0)
> p->prio = prio;
> @@ -1777,18 +1739,21 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
> atomic_long_add(p->pages, &nr_swap_pages);
> total_swap_pages += p->pages;
>
> - /* insert swap space into swap_list: */
> - prev = -1;
> - for (i = swap_list.head; i >= 0; i = swap_info[i]->next) {
> - if (p->prio >= swap_info[i]->prio)
> - break;
> - prev = i;
> + assert_spin_locked(&swap_lock);
> + BUG_ON(!list_empty(&p->list));
> + /* insert into swap list: */
> + list_for_each_entry(si, &swap_list_head, list) {
> + if (p->prio >= si->prio) {
> + list_add_tail(&p->list, &si->list);
> + return;
> + }
An additional comment saying that it must be priority ordered for
get_swap_page wouldn't kill.
> }
> - p->next = i;
> - if (prev < 0)
> - swap_list.head = swap_list.next = p->type;
> - else
> - swap_info[prev]->next = p->type;
> + /*
> + * this covers two cases:
> + * 1) p->prio is less than all existing prio
> + * 2) the swap list is empty
> + */
> + list_add_tail(&p->list, &swap_list_head);
> }
>
> static void enable_swap_info(struct swap_info_struct *p, int prio,
> @@ -1823,8 +1788,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> struct address_space *mapping;
> struct inode *inode;
> struct filename *pathname;
> - int i, type, prev;
> - int err;
> + int err, found = 0;
> unsigned int old_block_size;
>
> if (!capable(CAP_SYS_ADMIN))
> @@ -1842,17 +1806,16 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> goto out;
>
> mapping = victim->f_mapping;
> - prev = -1;
> spin_lock(&swap_lock);
> - for (type = swap_list.head; type >= 0; type = swap_info[type]->next) {
> - p = swap_info[type];
> + list_for_each_entry(p, &swap_list_head, list) {
> if (p->flags & SWP_WRITEOK) {
> - if (p->swap_file->f_mapping == mapping)
> + if (p->swap_file->f_mapping == mapping) {
> + found = 1;
> break;
> + }
> }
> - prev = type;
> }
> - if (type < 0) {
> + if (!found) {
> err = -EINVAL;
> spin_unlock(&swap_lock);
> goto out_dput;
> @@ -1864,20 +1827,15 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> spin_unlock(&swap_lock);
> goto out_dput;
> }
> - if (prev < 0)
> - swap_list.head = p->next;
> - else
> - swap_info[prev]->next = p->next;
> - if (type == swap_list.next) {
> - /* just pick something that's safe... */
> - swap_list.next = swap_list.head;
> - }
> spin_lock(&p->lock);
> if (p->prio < 0) {
> - for (i = p->next; i >= 0; i = swap_info[i]->next)
> - swap_info[i]->prio = p->prio--;
> + struct swap_info_struct *si = p;
> + list_for_each_entry_continue(si, &swap_list_head, list) {
> + si->prio++;
> + }
> least_priority++;
> }
> + list_del_init(&p->list);
> atomic_long_sub(p->pages, &nr_swap_pages);
> total_swap_pages -= p->pages;
> p->flags &= ~SWP_WRITEOK;
> @@ -1885,7 +1843,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> spin_unlock(&swap_lock);
>
> set_current_oom_origin();
> - err = try_to_unuse(type, false, 0); /* force all pages to be unused */
> + err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
> clear_current_oom_origin();
>
> if (err) {
> @@ -1926,7 +1884,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> frontswap_map = frontswap_map_get(p);
> spin_unlock(&p->lock);
> spin_unlock(&swap_lock);
> - frontswap_invalidate_area(type);
> + frontswap_invalidate_area(p->type);
> frontswap_map_set(p, NULL);
> mutex_unlock(&swapon_mutex);
> free_percpu(p->percpu_cluster);
> @@ -1935,7 +1893,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> vfree(cluster_info);
> vfree(frontswap_map);
> /* Destroy swap account information */
> - swap_cgroup_swapoff(type);
> + swap_cgroup_swapoff(p->type);
>
> inode = mapping->host;
> if (S_ISBLK(inode->i_mode)) {
> @@ -2142,8 +2100,8 @@ static struct swap_info_struct *alloc_swap_info(void)
> */
> }
> INIT_LIST_HEAD(&p->first_swap_extent.list);
> + INIT_LIST_HEAD(&p->list);
> p->flags = SWP_USED;
> - p->next = -1;
> spin_unlock(&swap_lock);
> spin_lock_init(&p->lock);
>
> --
> 1.8.3.1
>
--
Mel Gorman
SUSE Labs
On Sat, Apr 12, 2014 at 05:00:54PM -0400, Dan Streetman wrote:
> Originally get_swap_page() started iterating through the singly-linked
> list of swap_info_structs using swap_list.next or highest_priority_index,
> which both were intended to point to the highest priority active swap
> target that was not full. The previous patch in this series changed the
> singly-linked list to a doubly-linked list, and removed the logic to start
> at the highest priority non-full entry; it starts scanning at the highest
> priority entry each time, even if the entry is full.
>
> Add a new list, also priority ordered, to track only swap_info_structs
> that are available, i.e. active and not full. Use a new spinlock so that
> entries can be added/removed outside of get_swap_page; that wasn't possible
> previously because the main list is protected by swap_lock, which can't be
> taken when holding a swap_info_struct->lock because of locking order.
> The get_swap_page() logic now does not need to hold the swap_lock, and it
> iterates only through swap_info_structs that are available.
>
> Signed-off-by: Dan Streetman <[email protected]>
> ---
> include/linux/swap.h | 1 +
> mm/swapfile.c | 128 ++++++++++++++++++++++++++++++++++-----------------
> 2 files changed, 87 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 96662d8..d9263db 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -214,6 +214,7 @@ struct percpu_cluster {
> struct swap_info_struct {
> unsigned long flags; /* SWP_USED etc: see above */
> signed short prio; /* swap priority of this type */
> + struct list_head prio_list; /* entry in priority list */
> struct list_head list; /* entry in swap list */
> signed char type; /* strange name for an index */
> unsigned int max; /* extent of the swap_map */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b958645..3c38461 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -57,9 +57,13 @@ static const char Unused_file[] = "Unused swap file entry ";
> static const char Bad_offset[] = "Bad swap offset entry ";
> static const char Unused_offset[] = "Unused swap offset entry ";
>
> -/* all active swap_info */
> +/* all active swap_info; protected with swap_lock */
> LIST_HEAD(swap_list_head);
>
> +/* all available (active, not full) swap_info, priority ordered */
> +static LIST_HEAD(prio_head);
> +static DEFINE_SPINLOCK(prio_lock);
> +
I get why you maintain two lists with separate locking but it's code that
is specific to swap and in many respects, it's very similar to a plist. Is
there a reason why plist was not used at least for prio_head? They're used
for futex's so presumably the performance is reasonable. It might reduce
the size of swapfile.c further.
It is the case that plist does not have the equivalent of rotate which
you need to recycle the entries of equal priority but you could add a
plist_shuffle helper that "rotates the list left if the next entry is of
equal priority".
I was going to suggest that you could then get rid of swap_list_head but
it's a relatively big change. swapoff wouldn't care but frontswap would
suffer if it had to walk all of swap_info[] to find all active swap
files.
> struct swap_info_struct *swap_info[MAX_SWAPFILES];
>
> static DEFINE_MUTEX(swapon_mutex);
> @@ -73,6 +77,27 @@ static inline unsigned char swap_count(unsigned char ent)
> return ent & ~SWAP_HAS_CACHE; /* may include SWAP_HAS_CONT flag */
> }
>
> +/*
> + * add, in priority order, swap_info (p)->(le) list_head to list (lh)
> + * this list-generic function is needed because both swap_list_head
> + * and prio_head need to be priority ordered:
> + * swap_list_head in swapoff to adjust lower negative prio swap_infos
> + * prio_list in get_swap_page to scan highest prio swap_info first
> + */
> +#define swap_info_list_add(p, lh, le) do { \
> + struct swap_info_struct *_si; \
> + BUG_ON(!list_empty(&(p)->le)); \
> + list_for_each_entry(_si, (lh), le) { \
> + if ((p)->prio >= _si->prio) { \
> + list_add_tail(&(p)->le, &_si->le); \
> + break; \
> + } \
> + } \
> + /* lh empty, or p lowest prio */ \
> + if (list_empty(&(p)->le)) \
> + list_add_tail(&(p)->le, (lh)); \
> +} while (0)
> +
Why is this a #define instead of a static uninlined function?
That aside, it's again very similar to what a plist does with some
minor structure modifications.
--
Mel Gorman
SUSE Labs
On Wed, Apr 23, 2014 at 11:34:00AM +0100, Mel Gorman wrote:
> > @@ -366,7 +361,7 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
> > }
> > vm_unacct_memory(pages);
> > *unused = pages_to_unuse;
> > - *swapid = type;
> > + *swapid = si->type;
> > ret = 0;
> > break;
> > }
> > @@ -413,7 +408,7 @@ void frontswap_shrink(unsigned long target_pages)
> > /*
> > * we don't want to hold swap_lock while doing a very
> > * lengthy try_to_unuse, but swap_list may change
> > - * so restart scan from swap_list.head each time
> > + * so restart scan from swap_list_head each time
> > */
> > spin_lock(&swap_lock);
> > ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4a7f7e6..b958645 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -51,14 +51,14 @@ atomic_long_t nr_swap_pages;
> > /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> > long total_swap_pages;
> > static int least_priority;
> > -static atomic_t highest_priority_index = ATOMIC_INIT(-1);
> >
> > static const char Bad_file[] = "Bad swap file entry ";
> > static const char Unused_file[] = "Unused swap file entry ";
> > static const char Bad_offset[] = "Bad swap offset entry ";
> > static const char Unused_offset[] = "Unused swap offset entry ";
> >
> > -struct swap_list_t swap_list = {-1, -1};
> > +/* all active swap_info */
> > +LIST_HEAD(swap_list_head);
> >
> > struct swap_info_struct *swap_info[MAX_SWAPFILES];
> >
> > @@ -640,66 +640,50 @@ no_page:
> >
> > swp_entry_t get_swap_page(void)
> > {
> > - struct swap_info_struct *si;
> > + struct swap_info_struct *si, *next;
> > pgoff_t offset;
> > - int type, next;
> > - int wrapped = 0;
> > - int hp_index;
> > + struct list_head *tmp;
> >
> > spin_lock(&swap_lock);
> > if (atomic_long_read(&nr_swap_pages) <= 0)
> > goto noswap;
> > atomic_long_dec(&nr_swap_pages);
> >
> > - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
> > - hp_index = atomic_xchg(&highest_priority_index, -1);
> > - /*
> > - * highest_priority_index records current highest priority swap
> > - * type which just frees swap entries. If its priority is
> > - * higher than that of swap_list.next swap type, we use it. It
> > - * isn't protected by swap_lock, so it can be an invalid value
> > - * if the corresponding swap type is swapoff. We double check
> > - * the flags here. It's even possible the swap type is swapoff
> > - * and swapon again and its priority is changed. In such rare
> > - * case, low prority swap type might be used, but eventually
> > - * high priority swap will be used after several rounds of
> > - * swap.
> > - */
> > - if (hp_index != -1 && hp_index != type &&
> > - swap_info[type]->prio < swap_info[hp_index]->prio &&
> > - (swap_info[hp_index]->flags & SWP_WRITEOK)) {
> > - type = hp_index;
> > - swap_list.next = type;
> > - }
> > -
> > - si = swap_info[type];
> > - next = si->next;
> > - if (next < 0 ||
> > - (!wrapped && si->prio != swap_info[next]->prio)) {
> > - next = swap_list.head;
> > - wrapped++;
> > - }
> > -
> > + list_for_each(tmp, &swap_list_head) {
> > + si = list_entry(tmp, typeof(*si), list);
> > spin_lock(&si->lock);
> > - if (!si->highest_bit) {
> > - spin_unlock(&si->lock);
> > - continue;
> > - }
> > - if (!(si->flags & SWP_WRITEOK)) {
> > + if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> > spin_unlock(&si->lock);
> > continue;
> > }
> >
> > - swap_list.next = next;
> > + /*
> > + * rotate the current swap_info that we're going to use
> > + * to after any other swap_info that have the same prio,
> > + * so that all equal-priority swap_info get used equally
> > + */
> > + next = si;
> > + list_for_each_entry_continue(next, &swap_list_head, list) {
> > + if (si->prio != next->prio)
> > + break;
> > + list_rotate_left(&si->list);
> > + next = si;
> > + }
> >
>
> The list manipulations will be a lot of cache writes as the list is shuffled
> around. On slow storage I do not think this will be noticable but it may
> be noticable on faster swap devices that are SSD based. I've added Shaohua
> Li to the cc as he has been concerned with the performance of swap in the
> past. Shaohua, can you run this patchset through any of your test cases
> with the addition that multiple swap files are used to see if the cache
> writes are noticable? You'll need multiple swap files, some of which are
> at equal priority so the list shuffling logic is triggered.
get_swap_page isn't hot so far (and we hold the swap_lock, which isn't
contended), guess it's because other problems hide it, for example tlb flush
overhead.
Thanks,
Shaohua
On Thu, Apr 24, 2014 at 08:17:05AM +0800, Shaohua Li wrote:
> On Wed, Apr 23, 2014 at 11:34:00AM +0100, Mel Gorman wrote:
> > > @@ -366,7 +361,7 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
> > > }
> > > vm_unacct_memory(pages);
> > > *unused = pages_to_unuse;
> > > - *swapid = type;
> > > + *swapid = si->type;
> > > ret = 0;
> > > break;
> > > }
> > > @@ -413,7 +408,7 @@ void frontswap_shrink(unsigned long target_pages)
> > > /*
> > > * we don't want to hold swap_lock while doing a very
> > > * lengthy try_to_unuse, but swap_list may change
> > > - * so restart scan from swap_list.head each time
> > > + * so restart scan from swap_list_head each time
> > > */
> > > spin_lock(&swap_lock);
> > > ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 4a7f7e6..b958645 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -51,14 +51,14 @@ atomic_long_t nr_swap_pages;
> > > /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> > > long total_swap_pages;
> > > static int least_priority;
> > > -static atomic_t highest_priority_index = ATOMIC_INIT(-1);
> > >
> > > static const char Bad_file[] = "Bad swap file entry ";
> > > static const char Unused_file[] = "Unused swap file entry ";
> > > static const char Bad_offset[] = "Bad swap offset entry ";
> > > static const char Unused_offset[] = "Unused swap offset entry ";
> > >
> > > -struct swap_list_t swap_list = {-1, -1};
> > > +/* all active swap_info */
> > > +LIST_HEAD(swap_list_head);
> > >
> > > struct swap_info_struct *swap_info[MAX_SWAPFILES];
> > >
> > > @@ -640,66 +640,50 @@ no_page:
> > >
> > > swp_entry_t get_swap_page(void)
> > > {
> > > - struct swap_info_struct *si;
> > > + struct swap_info_struct *si, *next;
> > > pgoff_t offset;
> > > - int type, next;
> > > - int wrapped = 0;
> > > - int hp_index;
> > > + struct list_head *tmp;
> > >
> > > spin_lock(&swap_lock);
> > > if (atomic_long_read(&nr_swap_pages) <= 0)
> > > goto noswap;
> > > atomic_long_dec(&nr_swap_pages);
> > >
> > > - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
> > > - hp_index = atomic_xchg(&highest_priority_index, -1);
> > > - /*
> > > - * highest_priority_index records current highest priority swap
> > > - * type which just frees swap entries. If its priority is
> > > - * higher than that of swap_list.next swap type, we use it. It
> > > - * isn't protected by swap_lock, so it can be an invalid value
> > > - * if the corresponding swap type is swapoff. We double check
> > > - * the flags here. It's even possible the swap type is swapoff
> > > - * and swapon again and its priority is changed. In such rare
> > > - * case, low prority swap type might be used, but eventually
> > > - * high priority swap will be used after several rounds of
> > > - * swap.
> > > - */
> > > - if (hp_index != -1 && hp_index != type &&
> > > - swap_info[type]->prio < swap_info[hp_index]->prio &&
> > > - (swap_info[hp_index]->flags & SWP_WRITEOK)) {
> > > - type = hp_index;
> > > - swap_list.next = type;
> > > - }
> > > -
> > > - si = swap_info[type];
> > > - next = si->next;
> > > - if (next < 0 ||
> > > - (!wrapped && si->prio != swap_info[next]->prio)) {
> > > - next = swap_list.head;
> > > - wrapped++;
> > > - }
> > > -
> > > + list_for_each(tmp, &swap_list_head) {
> > > + si = list_entry(tmp, typeof(*si), list);
> > > spin_lock(&si->lock);
> > > - if (!si->highest_bit) {
> > > - spin_unlock(&si->lock);
> > > - continue;
> > > - }
> > > - if (!(si->flags & SWP_WRITEOK)) {
> > > + if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> > > spin_unlock(&si->lock);
> > > continue;
> > > }
> > >
> > > - swap_list.next = next;
> > > + /*
> > > + * rotate the current swap_info that we're going to use
> > > + * to after any other swap_info that have the same prio,
> > > + * so that all equal-priority swap_info get used equally
> > > + */
> > > + next = si;
> > > + list_for_each_entry_continue(next, &swap_list_head, list) {
> > > + if (si->prio != next->prio)
> > > + break;
> > > + list_rotate_left(&si->list);
> > > + next = si;
> > > + }
> > >
> >
> > The list manipulations will be a lot of cache writes as the list is shuffled
> > around. On slow storage I do not think this will be noticable but it may
> > be noticable on faster swap devices that are SSD based. I've added Shaohua
> > Li to the cc as he has been concerned with the performance of swap in the
> > past. Shaohua, can you run this patchset through any of your test cases
> > with the addition that multiple swap files are used to see if the cache
> > writes are noticable? You'll need multiple swap files, some of which are
> > at equal priority so the list shuffling logic is triggered.
>
> get_swap_page isn't hot so far (and we hold the swap_lock, which isn't
> contended), guess it's because other problems hide it, for example tlb flush
> overhead.
>
The old method was not free either but I wanted to be sure you were
aware of the series just in case. Thanks.
--
Mel Gorman
SUSE Labs
On Wed, Apr 23, 2014 at 9:14 AM, Mel Gorman <[email protected]> wrote:
> On Sat, Apr 12, 2014 at 05:00:54PM -0400, Dan Streetman wrote:
>> Originally get_swap_page() started iterating through the singly-linked
>> list of swap_info_structs using swap_list.next or highest_priority_index,
>> which both were intended to point to the highest priority active swap
>> target that was not full. The previous patch in this series changed the
>> singly-linked list to a doubly-linked list, and removed the logic to start
>> at the highest priority non-full entry; it starts scanning at the highest
>> priority entry each time, even if the entry is full.
>>
>> Add a new list, also priority ordered, to track only swap_info_structs
>> that are available, i.e. active and not full. Use a new spinlock so that
>> entries can be added/removed outside of get_swap_page; that wasn't possible
>> previously because the main list is protected by swap_lock, which can't be
>> taken when holding a swap_info_struct->lock because of locking order.
>> The get_swap_page() logic now does not need to hold the swap_lock, and it
>> iterates only through swap_info_structs that are available.
>>
>> Signed-off-by: Dan Streetman <[email protected]>
>> ---
>> include/linux/swap.h | 1 +
>> mm/swapfile.c | 128 ++++++++++++++++++++++++++++++++++-----------------
>> 2 files changed, 87 insertions(+), 42 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 96662d8..d9263db 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -214,6 +214,7 @@ struct percpu_cluster {
>> struct swap_info_struct {
>> unsigned long flags; /* SWP_USED etc: see above */
>> signed short prio; /* swap priority of this type */
>> + struct list_head prio_list; /* entry in priority list */
>> struct list_head list; /* entry in swap list */
>> signed char type; /* strange name for an index */
>> unsigned int max; /* extent of the swap_map */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index b958645..3c38461 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -57,9 +57,13 @@ static const char Unused_file[] = "Unused swap file entry ";
>> static const char Bad_offset[] = "Bad swap offset entry ";
>> static const char Unused_offset[] = "Unused swap offset entry ";
>>
>> -/* all active swap_info */
>> +/* all active swap_info; protected with swap_lock */
>> LIST_HEAD(swap_list_head);
>>
>> +/* all available (active, not full) swap_info, priority ordered */
>> +static LIST_HEAD(prio_head);
>> +static DEFINE_SPINLOCK(prio_lock);
>> +
>
> I get why you maintain two lists with separate locking but it's code that
> is specific to swap and in many respects, it's very similar to a plist. Is
> there a reason why plist was not used at least for prio_head? They're used
> for futex's so presumably the performance is reasonable. It might reduce
> the size of swapfile.c further.
>
> It is the case that plist does not have the equivalent of rotate which
> you need to recycle the entries of equal priority but you could add a
> plist_shuffle helper that "rotates the list left if the next entry is of
> equal priority".
I did look at plist, but as you said there's no plist_rotate_left() so
that would either need to be implemented in plist or a helper specific
to swap added.
The plist sort order is also reverse from swap sort order; plist
orders low->high while swap orders high->low. So either the prio
would need to be negated when storing in the plist to correct the
order, or plist_for_each_entry_reverse() would need to be added (or
another helper specific for swap).
I think the main (only?) benefit of plist is during adds; insertion
for lists is O(N) while insertion for plists is O(K) where K is the
number of different priorities. And for swap, the add only happens in
two places: on swapon (or if swapoff fails and reinserts it) and in
swap_entry_free() when the swap_info_struct changes from full to
not-full. The swapon and swapoff failure cases don't matter, and a
swap_info_struct changing from full to not-full is (probably?) going
to be a relatively rare occurrence. And even then, unless there are a
significant number of not-full same-priority swap_info_structs that
are higher prio than the one being added, there should (?) be no
difference between plist and normal list add speed.
Finally, using a plist further increases the size of each swap_info_struct.
So to me, it seemed list plists were best used in situations where
list entries were constantly being added/removed, and the add speed
needed to be as fast as possible. That isn't typically the case with
the swap_info_struct priority list, where adding/removing is a
relatively rare occurrence. However, in the case where there are
multiple groups of many same-priority swap devices/files, using plists
may reduce the add insertion time when one of the lower-priority
swap_info_struct changes from full to not-full after some of the
higher prio ones also have changed from full to not-full. If you
think it would be good to change to using a plist, I can update the
patch.
Otherwise, I can update the patch to add more code comments about why
a plist wasn't used, and/or update the commit log.
> I was going to suggest that you could then get rid of swap_list_head but
> it's a relatively big change. swapoff wouldn't care but frontswap would
> suffer if it had to walk all of swap_info[] to find all active swap
> files.
>
>> struct swap_info_struct *swap_info[MAX_SWAPFILES];
>>
>> static DEFINE_MUTEX(swapon_mutex);
>> @@ -73,6 +77,27 @@ static inline unsigned char swap_count(unsigned char ent)
>> return ent & ~SWAP_HAS_CACHE; /* may include SWAP_HAS_CONT flag */
>> }
>>
>> +/*
>> + * add, in priority order, swap_info (p)->(le) list_head to list (lh)
>> + * this list-generic function is needed because both swap_list_head
>> + * and prio_head need to be priority ordered:
>> + * swap_list_head in swapoff to adjust lower negative prio swap_infos
>> + * prio_list in get_swap_page to scan highest prio swap_info first
>> + */
>> +#define swap_info_list_add(p, lh, le) do { \
>> + struct swap_info_struct *_si; \
>> + BUG_ON(!list_empty(&(p)->le)); \
>> + list_for_each_entry(_si, (lh), le) { \
>> + if ((p)->prio >= _si->prio) { \
>> + list_add_tail(&(p)->le, &_si->le); \
>> + break; \
>> + } \
>> + } \
>> + /* lh empty, or p lowest prio */ \
>> + if (list_empty(&(p)->le)) \
>> + list_add_tail(&(p)->le, (lh)); \
>> +} while (0)
>> +
>
> Why is this a #define instead of a static uninlined function?
>
> That aside, it's again very similar to what a plist does with some
> minor structure modifications.
It's a #define because the ->member is different; for the
swap_list_head the entry member name is list while for the prio_head
the entry name is prio_list.
If the patch was accepted, I also planned to follow up with a patch to
add list_add_ordered(new, head, bool compare(a, b)) that would replace
this swap-specific #define. I do agree having an ordered list is
similar to what plist provides, but I don't think plist is a perfect
fit for what swap needs in this case.
>
> --
> Mel Gorman
> SUSE Labs
On Wed, Apr 23, 2014 at 6:34 AM, Mel Gorman <[email protected]> wrote:
> On Sat, Apr 12, 2014 at 05:00:53PM -0400, Dan Streetman wrote:
>> Replace the singly-linked list tracking active, i.e. swapon'ed,
>> swap_info_struct entries with a doubly-linked list using struct list_heads.
>> Simplify the logic iterating and manipulating the list of entries,
>> especially get_swap_page(), by using standard list_head functions,
>> and removing the highest priority iteration logic.
>>
>> The change fixes the bug:
>> https://lkml.org/lkml/2014/2/13/181
>> in which different priority swap entries after the highest priority entry
>> are incorrectly used equally in pairs. The swap behavior is now as
>> advertised, i.e. different priority swap entries are used in order, and
>> equal priority swap targets are used concurrently.
>>
>> Signed-off-by: Dan Streetman <[email protected]>
>> ---
>> include/linux/swap.h | 7 +--
>> include/linux/swapfile.h | 2 +-
>> mm/frontswap.c | 13 ++--
>> mm/swapfile.c | 156 +++++++++++++++++------------------------------
>> 4 files changed, 63 insertions(+), 115 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 3507115..96662d8 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -214,8 +214,8 @@ struct percpu_cluster {
>> struct swap_info_struct {
>> unsigned long flags; /* SWP_USED etc: see above */
>> signed short prio; /* swap priority of this type */
>> + struct list_head list; /* entry in swap list */
>> signed char type; /* strange name for an index */
>> - signed char next; /* next type on the swap list */
>> unsigned int max; /* extent of the swap_map */
>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>> @@ -255,11 +255,6 @@ struct swap_info_struct {
>> struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
>> };
>>
>> -struct swap_list_t {
>> - int head; /* head of priority-ordered swapfile list */
>> - int next; /* swapfile to be used next */
>> -};
>> -
>> /* linux/mm/workingset.c */
>> void *workingset_eviction(struct address_space *mapping, struct page *page);
>> bool workingset_refault(void *shadow);
>> diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
>> index e282624..2eab382 100644
>> --- a/include/linux/swapfile.h
>> +++ b/include/linux/swapfile.h
>> @@ -6,7 +6,7 @@
>> * want to expose them to the dozens of source files that include swap.h
>> */
>> extern spinlock_t swap_lock;
>> -extern struct swap_list_t swap_list;
>> +extern struct list_head swap_list_head;
>> extern struct swap_info_struct *swap_info[];
>> extern int try_to_unuse(unsigned int, bool, unsigned long);
>>
>> diff --git a/mm/frontswap.c b/mm/frontswap.c
>> index 1b24bdc..fae1160 100644
>> --- a/mm/frontswap.c
>> +++ b/mm/frontswap.c
>> @@ -327,15 +327,12 @@ EXPORT_SYMBOL(__frontswap_invalidate_area);
>>
>> static unsigned long __frontswap_curr_pages(void)
>> {
>> - int type;
>> unsigned long totalpages = 0;
>> struct swap_info_struct *si = NULL;
>>
>> assert_spin_locked(&swap_lock);
>> - for (type = swap_list.head; type >= 0; type = si->next) {
>> - si = swap_info[type];
>> + list_for_each_entry(si, &swap_list_head, list)
>> totalpages += atomic_read(&si->frontswap_pages);
>> - }
>> return totalpages;
>> }
>>
>> @@ -347,11 +344,9 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
>> int si_frontswap_pages;
>> unsigned long total_pages_to_unuse = total;
>> unsigned long pages = 0, pages_to_unuse = 0;
>> - int type;
>>
>> assert_spin_locked(&swap_lock);
>> - for (type = swap_list.head; type >= 0; type = si->next) {
>> - si = swap_info[type];
>> + list_for_each_entry(si, &swap_list_head, list) {
>> si_frontswap_pages = atomic_read(&si->frontswap_pages);
>> if (total_pages_to_unuse < si_frontswap_pages) {
>> pages = pages_to_unuse = total_pages_to_unuse;
>
> The frontswap shrink code looks suspicious. If the target is smaller than
> the total number of frontswap pages then it does nothing. The callers
> appear to get this right at least. Similarly, if the first swapfile has
> fewer frontswap pages than the target then it does not unuse the target
> number of pages because it only handles one swap file. It's outside the
> scope of your patch to address this or wonder if xen balloon driver is
> really using it the way it's expected.
I didn't look into the frontswap shrinking code, but I agree the
existing logic there doesn't look right. I'll review frontswap in
more detail to see if it needs changing here, unless anyone else gets
it to first :-)
And as you said, it's outside the scope of this particular patch.
>
> The old code scanned the files in priority order. Superficially this does
> not appear to but it actually does because you add the swap files to the
> list in priority order during swapon. If you do another revision it's
> worth adding a comment above swap_list_head that the list is ordered by
> priority and protected by the swap_lock.
Yep you're right, I will add a comment to make clear it's also
priority ordered, and why. The only reason it needs to be priority
ordered is because swapoff has to adjust the auto-priority of any
following swap_info_structs when one is removed, that has an
automatically assigned (i.e. negative) priority.
>
>> @@ -366,7 +361,7 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
>> }
>> vm_unacct_memory(pages);
>> *unused = pages_to_unuse;
>> - *swapid = type;
>> + *swapid = si->type;
>> ret = 0;
>> break;
>> }
>> @@ -413,7 +408,7 @@ void frontswap_shrink(unsigned long target_pages)
>> /*
>> * we don't want to hold swap_lock while doing a very
>> * lengthy try_to_unuse, but swap_list may change
>> - * so restart scan from swap_list.head each time
>> + * so restart scan from swap_list_head each time
>> */
>> spin_lock(&swap_lock);
>> ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 4a7f7e6..b958645 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -51,14 +51,14 @@ atomic_long_t nr_swap_pages;
>> /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>> long total_swap_pages;
>> static int least_priority;
>> -static atomic_t highest_priority_index = ATOMIC_INIT(-1);
>>
>> static const char Bad_file[] = "Bad swap file entry ";
>> static const char Unused_file[] = "Unused swap file entry ";
>> static const char Bad_offset[] = "Bad swap offset entry ";
>> static const char Unused_offset[] = "Unused swap offset entry ";
>>
>> -struct swap_list_t swap_list = {-1, -1};
>> +/* all active swap_info */
>> +LIST_HEAD(swap_list_head);
>>
>> struct swap_info_struct *swap_info[MAX_SWAPFILES];
>>
>> @@ -640,66 +640,50 @@ no_page:
>>
>> swp_entry_t get_swap_page(void)
>> {
>> - struct swap_info_struct *si;
>> + struct swap_info_struct *si, *next;
>> pgoff_t offset;
>> - int type, next;
>> - int wrapped = 0;
>> - int hp_index;
>> + struct list_head *tmp;
>>
>> spin_lock(&swap_lock);
>> if (atomic_long_read(&nr_swap_pages) <= 0)
>> goto noswap;
>> atomic_long_dec(&nr_swap_pages);
>>
>> - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
>> - hp_index = atomic_xchg(&highest_priority_index, -1);
>> - /*
>> - * highest_priority_index records current highest priority swap
>> - * type which just frees swap entries. If its priority is
>> - * higher than that of swap_list.next swap type, we use it. It
>> - * isn't protected by swap_lock, so it can be an invalid value
>> - * if the corresponding swap type is swapoff. We double check
>> - * the flags here. It's even possible the swap type is swapoff
>> - * and swapon again and its priority is changed. In such rare
>> - * case, low prority swap type might be used, but eventually
>> - * high priority swap will be used after several rounds of
>> - * swap.
>> - */
>> - if (hp_index != -1 && hp_index != type &&
>> - swap_info[type]->prio < swap_info[hp_index]->prio &&
>> - (swap_info[hp_index]->flags & SWP_WRITEOK)) {
>> - type = hp_index;
>> - swap_list.next = type;
>> - }
>> -
>> - si = swap_info[type];
>> - next = si->next;
>> - if (next < 0 ||
>> - (!wrapped && si->prio != swap_info[next]->prio)) {
>> - next = swap_list.head;
>> - wrapped++;
>> - }
>> -
>> + list_for_each(tmp, &swap_list_head) {
>> + si = list_entry(tmp, typeof(*si), list);
>> spin_lock(&si->lock);
>> - if (!si->highest_bit) {
>> - spin_unlock(&si->lock);
>> - continue;
>> - }
>> - if (!(si->flags & SWP_WRITEOK)) {
>> + if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
>> spin_unlock(&si->lock);
>> continue;
>> }
>>
>> - swap_list.next = next;
>> + /*
>> + * rotate the current swap_info that we're going to use
>> + * to after any other swap_info that have the same prio,
>> + * so that all equal-priority swap_info get used equally
>> + */
>> + next = si;
>> + list_for_each_entry_continue(next, &swap_list_head, list) {
>> + if (si->prio != next->prio)
>> + break;
>> + list_rotate_left(&si->list);
>> + next = si;
>> + }
>>
>
> The list manipulations will be a lot of cache writes as the list is shuffled
> around. On slow storage I do not think this will be noticable but it may
> be noticable on faster swap devices that are SSD based. I've added Shaohua
> Li to the cc as he has been concerned with the performance of swap in the
> past. Shaohua, can you run this patchset through any of your test cases
> with the addition that multiple swap files are used to see if the cache
> writes are noticable? You'll need multiple swap files, some of which are
> at equal priority so the list shuffling logic is triggered.
One performance improvement could be instead of rotating the current
entry past each following same-prio entry, just scan to the end of the
same-prio entries and move the current entry there; that would skip
the extra writes. Especially since this code will run for each
get_swap_page(), no need for any unnecessary writes.
>
>> spin_unlock(&swap_lock);
>> /* This is called for allocating swap entry for cache */
>> offset = scan_swap_map(si, SWAP_HAS_CACHE);
>> spin_unlock(&si->lock);
>> if (offset)
>> - return swp_entry(type, offset);
>> + return swp_entry(si->type, offset);
>> spin_lock(&swap_lock);
>> - next = swap_list.next;
>> + /*
>> + * shouldn't really have got here, but for some reason the
>> + * scan_swap_map came back empty for this swap_info.
>> + * Since we dropped the swap_lock, there may now be
>> + * non-full higher prio swap_infos; let's start over.
>> + */
>> + tmp = &swap_list_head;
>> }
>
> Has this ever triggered? The number of swap pages was examined under the
> swap lock so no other process should have been iterating through the
> swap files. Once a candidate was found, the si lock was acquired for the
> swap scan map so nothing else should have raced with it.
Well scan_swap_map() does drop the si->lock if it has any trouble at
all finding an offset to use, so I think it's possible that for a
nearly-full si multiple concurrent get_swap_page() calls could enter
scan_swap_map() with the same si, only some of them actually get pages
from the si and then the si becomes full, and the other threads in
scan_swap_map() see it's full and exit in failure. I can update the
code comment there to better indicate why it was reached, instead of
just saying "we shouldn't have got here" :-)
It may also be worth trying to get a better indicator of "available"
swap_info_structs for use in get_swap_page(), either by looking at
something other than si->highest_bit and/or keeping the si out of the
prio_list until it's actually available for use, not just has a single
entry free. However, that probably won't be simple and might be
better as a separate patch to the rest of these changes.
>
>>
>> atomic_long_inc(&nr_swap_pages);
>> @@ -766,27 +750,6 @@ out:
>> return NULL;
>> }
>>
>> -/*
>> - * This swap type frees swap entry, check if it is the highest priority swap
>> - * type which just frees swap entry. get_swap_page() uses
>> - * highest_priority_index to search highest priority swap type. The
>> - * swap_info_struct.lock can't protect us if there are multiple swap types
>> - * active, so we use atomic_cmpxchg.
>> - */
>> -static void set_highest_priority_index(int type)
>> -{
>> - int old_hp_index, new_hp_index;
>> -
>> - do {
>> - old_hp_index = atomic_read(&highest_priority_index);
>> - if (old_hp_index != -1 &&
>> - swap_info[old_hp_index]->prio >= swap_info[type]->prio)
>> - break;
>> - new_hp_index = type;
>> - } while (atomic_cmpxchg(&highest_priority_index,
>> - old_hp_index, new_hp_index) != old_hp_index);
>> -}
>> -
>> static unsigned char swap_entry_free(struct swap_info_struct *p,
>> swp_entry_t entry, unsigned char usage)
>> {
>> @@ -830,7 +793,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> p->lowest_bit = offset;
>> if (offset > p->highest_bit)
>> p->highest_bit = offset;
>> - set_highest_priority_index(p->type);
>> atomic_long_inc(&nr_swap_pages);
>> p->inuse_pages--;
>> frontswap_invalidate_page(p->type, offset);
>> @@ -1765,7 +1727,7 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
>> unsigned char *swap_map,
>> struct swap_cluster_info *cluster_info)
>> {
>> - int i, prev;
>> + struct swap_info_struct *si;
>>
>> if (prio >= 0)
>> p->prio = prio;
>> @@ -1777,18 +1739,21 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
>> atomic_long_add(p->pages, &nr_swap_pages);
>> total_swap_pages += p->pages;
>>
>> - /* insert swap space into swap_list: */
>> - prev = -1;
>> - for (i = swap_list.head; i >= 0; i = swap_info[i]->next) {
>> - if (p->prio >= swap_info[i]->prio)
>> - break;
>> - prev = i;
>> + assert_spin_locked(&swap_lock);
>> + BUG_ON(!list_empty(&p->list));
>> + /* insert into swap list: */
>> + list_for_each_entry(si, &swap_list_head, list) {
>> + if (p->prio >= si->prio) {
>> + list_add_tail(&p->list, &si->list);
>> + return;
>> + }
>
> An additional comment saying that it must be priority ordered for
> get_swap_page wouldn't kill.
will do.
>
>> }
>> - p->next = i;
>> - if (prev < 0)
>> - swap_list.head = swap_list.next = p->type;
>> - else
>> - swap_info[prev]->next = p->type;
>> + /*
>> + * this covers two cases:
>> + * 1) p->prio is less than all existing prio
>> + * 2) the swap list is empty
>> + */
>> + list_add_tail(&p->list, &swap_list_head);
>> }
>>
>> static void enable_swap_info(struct swap_info_struct *p, int prio,
>> @@ -1823,8 +1788,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> struct address_space *mapping;
>> struct inode *inode;
>> struct filename *pathname;
>> - int i, type, prev;
>> - int err;
>> + int err, found = 0;
>> unsigned int old_block_size;
>>
>> if (!capable(CAP_SYS_ADMIN))
>> @@ -1842,17 +1806,16 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> goto out;
>>
>> mapping = victim->f_mapping;
>> - prev = -1;
>> spin_lock(&swap_lock);
>> - for (type = swap_list.head; type >= 0; type = swap_info[type]->next) {
>> - p = swap_info[type];
>> + list_for_each_entry(p, &swap_list_head, list) {
>> if (p->flags & SWP_WRITEOK) {
>> - if (p->swap_file->f_mapping == mapping)
>> + if (p->swap_file->f_mapping == mapping) {
>> + found = 1;
>> break;
>> + }
>> }
>> - prev = type;
>> }
>> - if (type < 0) {
>> + if (!found) {
>> err = -EINVAL;
>> spin_unlock(&swap_lock);
>> goto out_dput;
>> @@ -1864,20 +1827,15 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> spin_unlock(&swap_lock);
>> goto out_dput;
>> }
>> - if (prev < 0)
>> - swap_list.head = p->next;
>> - else
>> - swap_info[prev]->next = p->next;
>> - if (type == swap_list.next) {
>> - /* just pick something that's safe... */
>> - swap_list.next = swap_list.head;
>> - }
>> spin_lock(&p->lock);
>> if (p->prio < 0) {
>> - for (i = p->next; i >= 0; i = swap_info[i]->next)
>> - swap_info[i]->prio = p->prio--;
>> + struct swap_info_struct *si = p;
>> + list_for_each_entry_continue(si, &swap_list_head, list) {
>> + si->prio++;
>> + }
>> least_priority++;
>> }
>> + list_del_init(&p->list);
>> atomic_long_sub(p->pages, &nr_swap_pages);
>> total_swap_pages -= p->pages;
>> p->flags &= ~SWP_WRITEOK;
>> @@ -1885,7 +1843,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> spin_unlock(&swap_lock);
>>
>> set_current_oom_origin();
>> - err = try_to_unuse(type, false, 0); /* force all pages to be unused */
>> + err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
>> clear_current_oom_origin();
>>
>> if (err) {
>> @@ -1926,7 +1884,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> frontswap_map = frontswap_map_get(p);
>> spin_unlock(&p->lock);
>> spin_unlock(&swap_lock);
>> - frontswap_invalidate_area(type);
>> + frontswap_invalidate_area(p->type);
>> frontswap_map_set(p, NULL);
>> mutex_unlock(&swapon_mutex);
>> free_percpu(p->percpu_cluster);
>> @@ -1935,7 +1893,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> vfree(cluster_info);
>> vfree(frontswap_map);
>> /* Destroy swap account information */
>> - swap_cgroup_swapoff(type);
>> + swap_cgroup_swapoff(p->type);
>>
>> inode = mapping->host;
>> if (S_ISBLK(inode->i_mode)) {
>> @@ -2142,8 +2100,8 @@ static struct swap_info_struct *alloc_swap_info(void)
>> */
>> }
>> INIT_LIST_HEAD(&p->first_swap_extent.list);
>> + INIT_LIST_HEAD(&p->list);
>> p->flags = SWP_USED;
>> - p->next = -1;
>> spin_unlock(&swap_lock);
>> spin_lock_init(&p->lock);
>>
>> --
>> 1.8.3.1
>>
>
> --
> Mel Gorman
> SUSE Labs
On Fri, Apr 25, 2014 at 2:48 AM, Dan Streetman <[email protected]> wrote:
> On Wed, Apr 23, 2014 at 6:34 AM, Mel Gorman <[email protected]> wrote:
>> On Sat, Apr 12, 2014 at 05:00:53PM -0400, Dan Streetman wrote:
>>> Replace the singly-linked list tracking active, i.e. swapon'ed,
>>> swap_info_struct entries with a doubly-linked list using struct list_heads.
>>> Simplify the logic iterating and manipulating the list of entries,
>>> especially get_swap_page(), by using standard list_head functions,
>>> and removing the highest priority iteration logic.
>>>
>>> The change fixes the bug:
>>> https://lkml.org/lkml/2014/2/13/181
>>> in which different priority swap entries after the highest priority entry
>>> are incorrectly used equally in pairs. The swap behavior is now as
>>> advertised, i.e. different priority swap entries are used in order, and
>>> equal priority swap targets are used concurrently.
>>>
>>> Signed-off-by: Dan Streetman <[email protected]>
>>> ---
>>> include/linux/swap.h | 7 +--
>>> include/linux/swapfile.h | 2 +-
>>> mm/frontswap.c | 13 ++--
>>> mm/swapfile.c | 156 +++++++++++++++++------------------------------
>>> 4 files changed, 63 insertions(+), 115 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 3507115..96662d8 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -214,8 +214,8 @@ struct percpu_cluster {
>>> struct swap_info_struct {
>>> unsigned long flags; /* SWP_USED etc: see above */
>>> signed short prio; /* swap priority of this type */
>>> + struct list_head list; /* entry in swap list */
>>> signed char type; /* strange name for an index */
>>> - signed char next; /* next type on the swap list */
>>> unsigned int max; /* extent of the swap_map */
>>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
>>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>> @@ -255,11 +255,6 @@ struct swap_info_struct {
>>> struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
>>> };
>>>
>>> -struct swap_list_t {
>>> - int head; /* head of priority-ordered swapfile list */
>>> - int next; /* swapfile to be used next */
>>> -};
>>> -
>>> /* linux/mm/workingset.c */
>>> void *workingset_eviction(struct address_space *mapping, struct page *page);
>>> bool workingset_refault(void *shadow);
>>> diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
>>> index e282624..2eab382 100644
>>> --- a/include/linux/swapfile.h
>>> +++ b/include/linux/swapfile.h
>>> @@ -6,7 +6,7 @@
>>> * want to expose them to the dozens of source files that include swap.h
>>> */
>>> extern spinlock_t swap_lock;
>>> -extern struct swap_list_t swap_list;
>>> +extern struct list_head swap_list_head;
>>> extern struct swap_info_struct *swap_info[];
>>> extern int try_to_unuse(unsigned int, bool, unsigned long);
>>>
>>> diff --git a/mm/frontswap.c b/mm/frontswap.c
>>> index 1b24bdc..fae1160 100644
>>> --- a/mm/frontswap.c
>>> +++ b/mm/frontswap.c
>>> @@ -327,15 +327,12 @@ EXPORT_SYMBOL(__frontswap_invalidate_area);
>>>
>>> static unsigned long __frontswap_curr_pages(void)
>>> {
>>> - int type;
>>> unsigned long totalpages = 0;
>>> struct swap_info_struct *si = NULL;
>>>
>>> assert_spin_locked(&swap_lock);
>>> - for (type = swap_list.head; type >= 0; type = si->next) {
>>> - si = swap_info[type];
>>> + list_for_each_entry(si, &swap_list_head, list)
>>> totalpages += atomic_read(&si->frontswap_pages);
>>> - }
>>> return totalpages;
>>> }
>>>
>>> @@ -347,11 +344,9 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
>>> int si_frontswap_pages;
>>> unsigned long total_pages_to_unuse = total;
>>> unsigned long pages = 0, pages_to_unuse = 0;
>>> - int type;
>>>
>>> assert_spin_locked(&swap_lock);
>>> - for (type = swap_list.head; type >= 0; type = si->next) {
>>> - si = swap_info[type];
>>> + list_for_each_entry(si, &swap_list_head, list) {
>>> si_frontswap_pages = atomic_read(&si->frontswap_pages);
>>> if (total_pages_to_unuse < si_frontswap_pages) {
>>> pages = pages_to_unuse = total_pages_to_unuse;
>>
>> The frontswap shrink code looks suspicious. If the target is smaller than
>> the total number of frontswap pages then it does nothing. The callers
>> appear to get this right at least. Similarly, if the first swapfile has
>> fewer frontswap pages than the target then it does not unuse the target
>> number of pages because it only handles one swap file. It's outside the
>> scope of your patch to address this or wonder if xen balloon driver is
>> really using it the way it's expected.
>
> I didn't look into the frontswap shrinking code, but I agree the
> existing logic there doesn't look right. I'll review frontswap in
> more detail to see if it needs changing here, unless anyone else gets
> it to first :-)
>
FYI, I drop the frontswap_shrink code in a patch
see: https://lkml.org/lkml/2014/1/27/98
> And as you said, it's outside the scope of this particular patch.
>
>>
>> The old code scanned the files in priority order. Superficially this does
>> not appear to but it actually does because you add the swap files to the
>> list in priority order during swapon. If you do another revision it's
>> worth adding a comment above swap_list_head that the list is ordered by
>> priority and protected by the swap_lock.
>
> Yep you're right, I will add a comment to make clear it's also
> priority ordered, and why. The only reason it needs to be priority
> ordered is because swapoff has to adjust the auto-priority of any
> following swap_info_structs when one is removed, that has an
> automatically assigned (i.e. negative) priority.
>
>>
>>> @@ -366,7 +361,7 @@ static int __frontswap_unuse_pages(unsigned long total, unsigned long *unused,
>>> }
>>> vm_unacct_memory(pages);
>>> *unused = pages_to_unuse;
>>> - *swapid = type;
>>> + *swapid = si->type;
>>> ret = 0;
>>> break;
>>> }
>>> @@ -413,7 +408,7 @@ void frontswap_shrink(unsigned long target_pages)
>>> /*
>>> * we don't want to hold swap_lock while doing a very
>>> * lengthy try_to_unuse, but swap_list may change
>>> - * so restart scan from swap_list.head each time
>>> + * so restart scan from swap_list_head each time
>>> */
>>> spin_lock(&swap_lock);
>>> ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 4a7f7e6..b958645 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -51,14 +51,14 @@ atomic_long_t nr_swap_pages;
>>> /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>>> long total_swap_pages;
>>> static int least_priority;
>>> -static atomic_t highest_priority_index = ATOMIC_INIT(-1);
>>>
>>> static const char Bad_file[] = "Bad swap file entry ";
>>> static const char Unused_file[] = "Unused swap file entry ";
>>> static const char Bad_offset[] = "Bad swap offset entry ";
>>> static const char Unused_offset[] = "Unused swap offset entry ";
>>>
>>> -struct swap_list_t swap_list = {-1, -1};
>>> +/* all active swap_info */
>>> +LIST_HEAD(swap_list_head);
>>>
>>> struct swap_info_struct *swap_info[MAX_SWAPFILES];
>>>
>>> @@ -640,66 +640,50 @@ no_page:
>>>
>>> swp_entry_t get_swap_page(void)
>>> {
>>> - struct swap_info_struct *si;
>>> + struct swap_info_struct *si, *next;
>>> pgoff_t offset;
>>> - int type, next;
>>> - int wrapped = 0;
>>> - int hp_index;
>>> + struct list_head *tmp;
>>>
>>> spin_lock(&swap_lock);
>>> if (atomic_long_read(&nr_swap_pages) <= 0)
>>> goto noswap;
>>> atomic_long_dec(&nr_swap_pages);
>>>
>>> - for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
>>> - hp_index = atomic_xchg(&highest_priority_index, -1);
>>> - /*
>>> - * highest_priority_index records current highest priority swap
>>> - * type which just frees swap entries. If its priority is
>>> - * higher than that of swap_list.next swap type, we use it. It
>>> - * isn't protected by swap_lock, so it can be an invalid value
>>> - * if the corresponding swap type is swapoff. We double check
>>> - * the flags here. It's even possible the swap type is swapoff
>>> - * and swapon again and its priority is changed. In such rare
>>> - * case, low prority swap type might be used, but eventually
>>> - * high priority swap will be used after several rounds of
>>> - * swap.
>>> - */
>>> - if (hp_index != -1 && hp_index != type &&
>>> - swap_info[type]->prio < swap_info[hp_index]->prio &&
>>> - (swap_info[hp_index]->flags & SWP_WRITEOK)) {
>>> - type = hp_index;
>>> - swap_list.next = type;
>>> - }
>>> -
>>> - si = swap_info[type];
>>> - next = si->next;
>>> - if (next < 0 ||
>>> - (!wrapped && si->prio != swap_info[next]->prio)) {
>>> - next = swap_list.head;
>>> - wrapped++;
>>> - }
>>> -
>>> + list_for_each(tmp, &swap_list_head) {
>>> + si = list_entry(tmp, typeof(*si), list);
>>> spin_lock(&si->lock);
>>> - if (!si->highest_bit) {
>>> - spin_unlock(&si->lock);
>>> - continue;
>>> - }
>>> - if (!(si->flags & SWP_WRITEOK)) {
>>> + if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
>>> spin_unlock(&si->lock);
>>> continue;
>>> }
>>>
>>> - swap_list.next = next;
>>> + /*
>>> + * rotate the current swap_info that we're going to use
>>> + * to after any other swap_info that have the same prio,
>>> + * so that all equal-priority swap_info get used equally
>>> + */
>>> + next = si;
>>> + list_for_each_entry_continue(next, &swap_list_head, list) {
>>> + if (si->prio != next->prio)
>>> + break;
>>> + list_rotate_left(&si->list);
>>> + next = si;
>>> + }
>>>
>>
>> The list manipulations will be a lot of cache writes as the list is shuffled
>> around. On slow storage I do not think this will be noticable but it may
>> be noticable on faster swap devices that are SSD based. I've added Shaohua
>> Li to the cc as he has been concerned with the performance of swap in the
>> past. Shaohua, can you run this patchset through any of your test cases
>> with the addition that multiple swap files are used to see if the cache
>> writes are noticable? You'll need multiple swap files, some of which are
>> at equal priority so the list shuffling logic is triggered.
>
> One performance improvement could be instead of rotating the current
> entry past each following same-prio entry, just scan to the end of the
> same-prio entries and move the current entry there; that would skip
> the extra writes. Especially since this code will run for each
> get_swap_page(), no need for any unnecessary writes.
>
>>
>>> spin_unlock(&swap_lock);
>>> /* This is called for allocating swap entry for cache */
>>> offset = scan_swap_map(si, SWAP_HAS_CACHE);
>>> spin_unlock(&si->lock);
>>> if (offset)
>>> - return swp_entry(type, offset);
>>> + return swp_entry(si->type, offset);
>>> spin_lock(&swap_lock);
>>> - next = swap_list.next;
>>> + /*
>>> + * shouldn't really have got here, but for some reason the
>>> + * scan_swap_map came back empty for this swap_info.
>>> + * Since we dropped the swap_lock, there may now be
>>> + * non-full higher prio swap_infos; let's start over.
>>> + */
>>> + tmp = &swap_list_head;
>>> }
>>
>> Has this ever triggered? The number of swap pages was examined under the
>> swap lock so no other process should have been iterating through the
>> swap files. Once a candidate was found, the si lock was acquired for the
>> swap scan map so nothing else should have raced with it.
>
> Well scan_swap_map() does drop the si->lock if it has any trouble at
> all finding an offset to use, so I think it's possible that for a
> nearly-full si multiple concurrent get_swap_page() calls could enter
> scan_swap_map() with the same si, only some of them actually get pages
> from the si and then the si becomes full, and the other threads in
> scan_swap_map() see it's full and exit in failure. I can update the
> code comment there to better indicate why it was reached, instead of
> just saying "we shouldn't have got here" :-)
>
> It may also be worth trying to get a better indicator of "available"
> swap_info_structs for use in get_swap_page(), either by looking at
> something other than si->highest_bit and/or keeping the si out of the
> prio_list until it's actually available for use, not just has a single
> entry free. However, that probably won't be simple and might be
> better as a separate patch to the rest of these changes.
>
>>
>>>
>>> atomic_long_inc(&nr_swap_pages);
>>> @@ -766,27 +750,6 @@ out:
>>> return NULL;
>>> }
>>>
>>> -/*
>>> - * This swap type frees swap entry, check if it is the highest priority swap
>>> - * type which just frees swap entry. get_swap_page() uses
>>> - * highest_priority_index to search highest priority swap type. The
>>> - * swap_info_struct.lock can't protect us if there are multiple swap types
>>> - * active, so we use atomic_cmpxchg.
>>> - */
>>> -static void set_highest_priority_index(int type)
>>> -{
>>> - int old_hp_index, new_hp_index;
>>> -
>>> - do {
>>> - old_hp_index = atomic_read(&highest_priority_index);
>>> - if (old_hp_index != -1 &&
>>> - swap_info[old_hp_index]->prio >= swap_info[type]->prio)
>>> - break;
>>> - new_hp_index = type;
>>> - } while (atomic_cmpxchg(&highest_priority_index,
>>> - old_hp_index, new_hp_index) != old_hp_index);
>>> -}
>>> -
>>> static unsigned char swap_entry_free(struct swap_info_struct *p,
>>> swp_entry_t entry, unsigned char usage)
>>> {
>>> @@ -830,7 +793,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>>> p->lowest_bit = offset;
>>> if (offset > p->highest_bit)
>>> p->highest_bit = offset;
>>> - set_highest_priority_index(p->type);
>>> atomic_long_inc(&nr_swap_pages);
>>> p->inuse_pages--;
>>> frontswap_invalidate_page(p->type, offset);
>>> @@ -1765,7 +1727,7 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
>>> unsigned char *swap_map,
>>> struct swap_cluster_info *cluster_info)
>>> {
>>> - int i, prev;
>>> + struct swap_info_struct *si;
>>>
>>> if (prio >= 0)
>>> p->prio = prio;
>>> @@ -1777,18 +1739,21 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
>>> atomic_long_add(p->pages, &nr_swap_pages);
>>> total_swap_pages += p->pages;
>>>
>>> - /* insert swap space into swap_list: */
>>> - prev = -1;
>>> - for (i = swap_list.head; i >= 0; i = swap_info[i]->next) {
>>> - if (p->prio >= swap_info[i]->prio)
>>> - break;
>>> - prev = i;
>>> + assert_spin_locked(&swap_lock);
>>> + BUG_ON(!list_empty(&p->list));
>>> + /* insert into swap list: */
>>> + list_for_each_entry(si, &swap_list_head, list) {
>>> + if (p->prio >= si->prio) {
>>> + list_add_tail(&p->list, &si->list);
>>> + return;
>>> + }
>>
>> An additional comment saying that it must be priority ordered for
>> get_swap_page wouldn't kill.
>
> will do.
>
>>
>>> }
>>> - p->next = i;
>>> - if (prev < 0)
>>> - swap_list.head = swap_list.next = p->type;
>>> - else
>>> - swap_info[prev]->next = p->type;
>>> + /*
>>> + * this covers two cases:
>>> + * 1) p->prio is less than all existing prio
>>> + * 2) the swap list is empty
>>> + */
>>> + list_add_tail(&p->list, &swap_list_head);
>>> }
>>>
>>> static void enable_swap_info(struct swap_info_struct *p, int prio,
>>> @@ -1823,8 +1788,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>> struct address_space *mapping;
>>> struct inode *inode;
>>> struct filename *pathname;
>>> - int i, type, prev;
>>> - int err;
>>> + int err, found = 0;
>>> unsigned int old_block_size;
>>>
>>> if (!capable(CAP_SYS_ADMIN))
>>> @@ -1842,17 +1806,16 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>> goto out;
>>>
>>> mapping = victim->f_mapping;
>>> - prev = -1;
>>> spin_lock(&swap_lock);
>>> - for (type = swap_list.head; type >= 0; type = swap_info[type]->next) {
>>> - p = swap_info[type];
>>> + list_for_each_entry(p, &swap_list_head, list) {
>>> if (p->flags & SWP_WRITEOK) {
>>> - if (p->swap_file->f_mapping == mapping)
>>> + if (p->swap_file->f_mapping == mapping) {
>>> + found = 1;
>>> break;
>>> + }
>>> }
>>> - prev = type;
>>> }
>>> - if (type < 0) {
>>> + if (!found) {
>>> err = -EINVAL;
>>> spin_unlock(&swap_lock);
>>> goto out_dput;
>>> @@ -1864,20 +1827,15 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>> spin_unlock(&swap_lock);
>>> goto out_dput;
>>> }
>>> - if (prev < 0)
>>> - swap_list.head = p->next;
>>> - else
>>> - swap_info[prev]->next = p->next;
>>> - if (type == swap_list.next) {
>>> - /* just pick something that's safe... */
>>> - swap_list.next = swap_list.head;
>>> - }
>>> spin_lock(&p->lock);
>>> if (p->prio < 0) {
>>> - for (i = p->next; i >= 0; i = swap_info[i]->next)
>>> - swap_info[i]->prio = p->prio--;
>>> + struct swap_info_struct *si = p;
>>> + list_for_each_entry_continue(si, &swap_list_head, list) {
>>> + si->prio++;
>>> + }
>>> least_priority++;
>>> }
>>> + list_del_init(&p->list);
>>> atomic_long_sub(p->pages, &nr_swap_pages);
>>> total_swap_pages -= p->pages;
>>> p->flags &= ~SWP_WRITEOK;
>>> @@ -1885,7 +1843,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>> spin_unlock(&swap_lock);
>>>
>>> set_current_oom_origin();
>>> - err = try_to_unuse(type, false, 0); /* force all pages to be unused */
>>> + err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
>>> clear_current_oom_origin();
>>>
>>> if (err) {
>>> @@ -1926,7 +1884,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>> frontswap_map = frontswap_map_get(p);
>>> spin_unlock(&p->lock);
>>> spin_unlock(&swap_lock);
>>> - frontswap_invalidate_area(type);
>>> + frontswap_invalidate_area(p->type);
>>> frontswap_map_set(p, NULL);
>>> mutex_unlock(&swapon_mutex);
>>> free_percpu(p->percpu_cluster);
>>> @@ -1935,7 +1893,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>> vfree(cluster_info);
>>> vfree(frontswap_map);
>>> /* Destroy swap account information */
>>> - swap_cgroup_swapoff(type);
>>> + swap_cgroup_swapoff(p->type);
>>>
>>> inode = mapping->host;
>>> if (S_ISBLK(inode->i_mode)) {
>>> @@ -2142,8 +2100,8 @@ static struct swap_info_struct *alloc_swap_info(void)
>>> */
>>> }
>>> INIT_LIST_HEAD(&p->first_swap_extent.list);
>>> + INIT_LIST_HEAD(&p->list);
>>> p->flags = SWP_USED;
>>> - p->next = -1;
>>> spin_unlock(&swap_lock);
>>> spin_lock_init(&p->lock);
>>>
>>> --
>>> 1.8.3.1
>>>
>>
>> --
>> Mel Gorman
>> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Thu, Apr 24, 2014 at 02:48:43PM -0400, Dan Streetman wrote:
> >> <SNIP>
> >> - }
> >> -
> >> + list_for_each(tmp, &swap_list_head) {
> >> + si = list_entry(tmp, typeof(*si), list);
> >> spin_lock(&si->lock);
> >> - if (!si->highest_bit) {
> >> - spin_unlock(&si->lock);
> >> - continue;
> >> - }
> >> - if (!(si->flags & SWP_WRITEOK)) {
> >> + if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> >> spin_unlock(&si->lock);
> >> continue;
> >> }
> >>
> >> - swap_list.next = next;
> >> + /*
> >> + * rotate the current swap_info that we're going to use
> >> + * to after any other swap_info that have the same prio,
> >> + * so that all equal-priority swap_info get used equally
> >> + */
> >> + next = si;
> >> + list_for_each_entry_continue(next, &swap_list_head, list) {
> >> + if (si->prio != next->prio)
> >> + break;
> >> + list_rotate_left(&si->list);
> >> + next = si;
> >> + }
> >>
> >
> > The list manipulations will be a lot of cache writes as the list is shuffled
> > around. On slow storage I do not think this will be noticable but it may
> > be noticable on faster swap devices that are SSD based. I've added Shaohua
> > Li to the cc as he has been concerned with the performance of swap in the
> > past. Shaohua, can you run this patchset through any of your test cases
> > with the addition that multiple swap files are used to see if the cache
> > writes are noticable? You'll need multiple swap files, some of which are
> > at equal priority so the list shuffling logic is triggered.
>
> One performance improvement could be instead of rotating the current
> entry past each following same-prio entry, just scan to the end of the
> same-prio entries and move the current entry there; that would skip
> the extra writes. Especially since this code will run for each
> get_swap_page(), no need for any unnecessary writes.
>
Shaohua is the person that would be most sensitive to performance problems
in this area and his tests are in the clear. If he's happy then I don't
think there is justification for changing the patch as-is.
> >
> >> spin_unlock(&swap_lock);
> >> /* This is called for allocating swap entry for cache */
> >> offset = scan_swap_map(si, SWAP_HAS_CACHE);
> >> spin_unlock(&si->lock);
> >> if (offset)
> >> - return swp_entry(type, offset);
> >> + return swp_entry(si->type, offset);
> >> spin_lock(&swap_lock);
> >> - next = swap_list.next;
> >> + /*
> >> + * shouldn't really have got here, but for some reason the
> >> + * scan_swap_map came back empty for this swap_info.
> >> + * Since we dropped the swap_lock, there may now be
> >> + * non-full higher prio swap_infos; let's start over.
> >> + */
> >> + tmp = &swap_list_head;
> >> }
> >
> > Has this ever triggered? The number of swap pages was examined under the
> > swap lock so no other process should have been iterating through the
> > swap files. Once a candidate was found, the si lock was acquired for the
> > swap scan map so nothing else should have raced with it.
>
> Well scan_swap_map() does drop the si->lock if it has any trouble at
> all finding an offset to use, so I think it's possible that for a
> nearly-full si multiple concurrent get_swap_page() calls could enter
> scan_swap_map() with the same si, only some of them actually get pages
> from the si and then the si becomes full, and the other threads in
> scan_swap_map() see it's full and exit in failure. I can update the
> code comment there to better indicate why it was reached, instead of
> just saying "we shouldn't have got here" :-)
>
With the updates to some comments then feel free to add
Acked-by: Mel Gorman <[email protected]>
> It may also be worth trying to get a better indicator of "available"
> swap_info_structs for use in get_swap_page(), either by looking at
> something other than si->highest_bit and/or keeping the si out of the
> prio_list until it's actually available for use, not just has a single
> entry free. However, that probably won't be simple and might be
> better as a separate patch to the rest of these changes.
>
I agree that it is likely outside the scope of what this series is meant
to accomplish.
--
Mel Gorman
SUSE Labs
On Thu, Apr 24, 2014 at 01:52:00PM -0400, Dan Streetman wrote:
> On Wed, Apr 23, 2014 at 9:14 AM, Mel Gorman <[email protected]> wrote:
> > On Sat, Apr 12, 2014 at 05:00:54PM -0400, Dan Streetman wrote:
> >> Originally get_swap_page() started iterating through the singly-linked
> >> list of swap_info_structs using swap_list.next or highest_priority_index,
> >> which both were intended to point to the highest priority active swap
> >> target that was not full. The previous patch in this series changed the
> >> singly-linked list to a doubly-linked list, and removed the logic to start
> >> at the highest priority non-full entry; it starts scanning at the highest
> >> priority entry each time, even if the entry is full.
> >>
> >> Add a new list, also priority ordered, to track only swap_info_structs
> >> that are available, i.e. active and not full. Use a new spinlock so that
> >> entries can be added/removed outside of get_swap_page; that wasn't possible
> >> previously because the main list is protected by swap_lock, which can't be
> >> taken when holding a swap_info_struct->lock because of locking order.
> >> The get_swap_page() logic now does not need to hold the swap_lock, and it
> >> iterates only through swap_info_structs that are available.
> >>
> >> Signed-off-by: Dan Streetman <[email protected]>
> >> ---
> >> include/linux/swap.h | 1 +
> >> mm/swapfile.c | 128 ++++++++++++++++++++++++++++++++++-----------------
> >> 2 files changed, 87 insertions(+), 42 deletions(-)
> >>
> >> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> index 96662d8..d9263db 100644
> >> --- a/include/linux/swap.h
> >> +++ b/include/linux/swap.h
> >> @@ -214,6 +214,7 @@ struct percpu_cluster {
> >> struct swap_info_struct {
> >> unsigned long flags; /* SWP_USED etc: see above */
> >> signed short prio; /* swap priority of this type */
> >> + struct list_head prio_list; /* entry in priority list */
> >> struct list_head list; /* entry in swap list */
> >> signed char type; /* strange name for an index */
> >> unsigned int max; /* extent of the swap_map */
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index b958645..3c38461 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -57,9 +57,13 @@ static const char Unused_file[] = "Unused swap file entry ";
> >> static const char Bad_offset[] = "Bad swap offset entry ";
> >> static const char Unused_offset[] = "Unused swap offset entry ";
> >>
> >> -/* all active swap_info */
> >> +/* all active swap_info; protected with swap_lock */
> >> LIST_HEAD(swap_list_head);
> >>
> >> +/* all available (active, not full) swap_info, priority ordered */
> >> +static LIST_HEAD(prio_head);
> >> +static DEFINE_SPINLOCK(prio_lock);
> >> +
> >
> > I get why you maintain two lists with separate locking but it's code that
> > is specific to swap and in many respects, it's very similar to a plist. Is
> > there a reason why plist was not used at least for prio_head? They're used
> > for futex's so presumably the performance is reasonable. It might reduce
> > the size of swapfile.c further.
> >
> > It is the case that plist does not have the equivalent of rotate which
> > you need to recycle the entries of equal priority but you could add a
> > plist_shuffle helper that "rotates the list left if the next entry is of
> > equal priority".
>
> I did look at plist, but as you said there's no plist_rotate_left() so
> that would either need to be implemented in plist or a helper specific
> to swap added.
>
Which in itself should not be impossible and improves an existing structure
that might be usable elsewhere again.
> The plist sort order is also reverse from swap sort order; plist
> orders low->high while swap orders high->low. So either the prio
> would need to be negated when storing in the plist to correct the
> order, or plist_for_each_entry_reverse() would need to be added (or
> another helper specific for swap).
>
Or add a new plist iterator that wraps around list_for_each_entry_reverse
instead of list_for_each_entry? I admit I didn't actually check if this
would work.
> I think the main (only?) benefit of plist is during adds; insertion
> for lists is O(N) while insertion for plists is O(K) where K is the
> number of different priorities. And for swap, the add only happens in
> two places: on swapon (or if swapoff fails and reinserts it) and in
> swap_entry_free() when the swap_info_struct changes from full to
> not-full. The swapon and swapoff failure cases don't matter, and a
> swap_info_struct changing from full to not-full is (probably?) going
> to be a relatively rare occurrence. And even then, unless there are a
> significant number of not-full same-priority swap_info_structs that
> are higher prio than the one being added, there should (?) be no
> difference between plist and normal list add speed.
>
I'm not angling towards plist for performance reasons but for maintenance
reasons. It would just be preferable to use existing structures and
iterators instead of adding new swap-specific code.
> Finally, using a plist further increases the size of each swap_info_struct.
>
Considering how many of them there are in the system I would not worry
too much about the memory footprint in this case. IT's not like we are
increasing the size of struct page :)
> So to me, it seemed list plists were best used in situations where
> list entries were constantly being added/removed, and the add speed
> needed to be as fast as possible. That isn't typically the case with
> the swap_info_struct priority list, where adding/removing is a
> relatively rare occurrence. However, in the case where there are
> multiple groups of many same-priority swap devices/files, using plists
> may reduce the add insertion time when one of the lower-priority
> swap_info_struct changes from full to not-full after some of the
> higher prio ones also have changed from full to not-full. If you
> think it would be good to change to using a plist, I can update the
> patch.
>
The full to not-full case is a concern but I also think it's a corner case
that's only going to be commonly hit on stress tests and rarely on "normal"
workloads. For maintenance reasons I would prefer if plist was used to
reduce the amount of swap-specific code and move to swap-specific code only
if there is a measurable gain from doing that. If that is not possible
or the performance would suck in the normal case then just update the
changelog accordingly so the next reviewer does not bring up the same topic.
Thanks!
--
Mel Gorman
SUSE Labs