2014-10-07 23:36:08

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 00/37] 3.14.21-stable review

This is the start of the stable review cycle for the 3.14.21 release.
There are 37 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.

Responses should be made by Thu Oct 9 23:18:16 UTC 2014.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.14.21-rc1.gz
and the diffstat can be found below.

thanks,

greg k-h

-------------
Pseudo-Shortlog of commits:

Greg Kroah-Hartman <[email protected]>
Linux 3.14.21-rc1

Linus Torvalds <[email protected]>
mm: don't pointlessly use BUG_ON() for sanity check

Davidlohr Bueso <[email protected]>
mm: per-thread vma caching

Christoph Lameter <[email protected]>
vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()

Vladimir Davydov <[email protected]>
mm: vmscan: shrink_slab: rename max_pass -> freeable

Vladimir Davydov <[email protected]>
mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim

Jens Axboe <[email protected]>
mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

Mel Gorman <[email protected]>
mm: optimize put_mems_allowed() usage

Raghavendra K T <[email protected]>
mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages

David Rientjes <[email protected]>
mm, compaction: ignore pageblock skip when manually invoking compaction

David Rientjes <[email protected]>
mm, compaction: determine isolation mode only once

Joonsoo Kim <[email protected]>
mm/compaction: clean-up code on success of ballon isolation

Joonsoo Kim <[email protected]>
mm/compaction: check pageblock suitability once per pageblock

Joonsoo Kim <[email protected]>
mm/compaction: change the timing to check to drop the spinlock

Lars Ellenberg <[email protected]>
drbd: fix regression 'out of mem, failed to invoke fence-peer helper'

Joonsoo Kim <[email protected]>
mm/compaction: do not call suitable_migration_target() on every page

Joonsoo Kim <[email protected]>
mm/compaction: disallow high-order page for migration target

David Rientjes <[email protected]>
mm, compaction: avoid isolating pinned pages

Dan Streetman <[email protected]>
swap: change swap_list_head to plist, add swap_avail_head

Dan Streetman <[email protected]>
lib/plist: add plist_requeue

Dan Streetman <[email protected]>
lib/plist: add helper functions

Dan Streetman <[email protected]>
swap: change swap_info singly-linked list to list_head

Michal Hocko <[email protected]>
mm: exclude memoryless nodes from zone_reclaim

Andrew Hunter <[email protected]>
jiffies: Fix timeval conversion to jiffies

Hans Verkuil <[email protected]>
media: vb2: fix VBI/poll regression

Mel Gorman <[email protected]>
mm: numa: Do not mark PTEs pte_numa when splitting huge pages

Waiman Long <[email protected]>
mm, thp: move invariant bug check out of loop in __split_huge_page_map

Nishanth Aravamudan <[email protected]>
hugetlb: ensure hugepage access is denied if hugepages are not supported

Pavel Shilovsky <[email protected]>
CIFS: Fix SMB2 readdir error handling

Steven Rostedt (Red Hat) <[email protected]>
ring-buffer: Fix infinite spin in reading buffer

Josh Triplett <[email protected]>
init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu

Steve French <[email protected]>
Fix problem recognizing symlinks

Chris Wilson <[email protected]>
drm/i915: Flush the PTEs after updating them before suspend

NeilBrown <[email protected]>
md/raid5: disable 'DISCARD' by default due to safety concerns.

Arnd Bergmann <[email protected]>
cpufreq: integrator: fix integrator_cpufreq_remove return type

Mel Gorman <[email protected]>
mm: migrate: Close race between migration completion and mprotect

Peter Zijlstra <[email protected]>
perf: fix perf bug in fork()

Jan Kara <[email protected]>
udf: Avoid infinite loop when processing indirect ICBs


-------------

Diffstat:

Makefile | 4 +-
arch/unicore32/include/asm/mmu_context.h | 4 +-
drivers/block/drbd/drbd_nl.c | 6 +
drivers/cpufreq/integrator-cpufreq.c | 4 +-
drivers/gpu/drm/i915/i915_gem_gtt.c | 14 +-
drivers/md/raid5.c | 18 ++-
drivers/media/v4l2-core/videobuf2-core.c | 15 ++-
fs/cifs/cifsglob.h | 2 +
fs/cifs/file.c | 2 +-
fs/cifs/readdir.c | 2 +-
fs/cifs/smb1ops.c | 9 +-
fs/cifs/smb2maperror.c | 4 +-
fs/cifs/smb2ops.c | 9 ++
fs/cifs/smb2pdu.c | 9 +-
fs/exec.c | 5 +-
fs/hugetlbfs/inode.c | 5 +
fs/proc/task_mmu.c | 3 +-
fs/udf/inode.c | 35 +++--
include/linux/cpuset.h | 27 ++--
include/linux/hugetlb.h | 10 ++
include/linux/jiffies.h | 12 --
include/linux/mm_types.h | 4 +-
include/linux/plist.h | 45 +++++++
include/linux/sched.h | 7 +
include/linux/swap.h | 8 +-
include/linux/swapfile.h | 2 +-
include/linux/vmacache.h | 38 ++++++
include/media/videobuf2-core.h | 4 +
init/Kconfig | 1 +
kernel/cpuset.c | 2 +-
kernel/debug/debug_core.c | 14 +-
kernel/events/core.c | 4 +-
kernel/fork.c | 12 +-
kernel/time.c | 54 ++++----
kernel/trace/ring_buffer.c | 2 +-
lib/plist.c | 52 +++++++
mm/Makefile | 2 +-
mm/compaction.c | 94 +++++++------
mm/filemap.c | 10 +-
mm/frontswap.c | 13 +-
mm/huge_memory.c | 11 +-
mm/hugetlb.c | 17 ++-
mm/mempolicy.c | 12 +-
mm/migrate.c | 5 +-
mm/mmap.c | 55 ++++----
mm/nommu.c | 24 ++--
mm/page_alloc.c | 13 +-
mm/readahead.c | 4 +-
mm/slab.c | 4 +-
mm/slub.c | 16 +--
mm/swapfile.c | 224 ++++++++++++++++---------------
mm/vmacache.c | 114 ++++++++++++++++
mm/vmscan.c | 32 ++---
53 files changed, 750 insertions(+), 348 deletions(-)


2014-10-07 23:20:07

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 22/37] mm/compaction: disallow high-order page for migration target

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Joonsoo Kim <[email protected]>

commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream.

Purpose of compaction is to get a high order page. Currently, if we
find high-order page while searching migration target page, we break it
to order-0 pages and use them as migration target. It is contrary to
purpose of compaction, so disallow high-order page to be used for
migration target.

Additionally, clean-up logic in suitable_migration_target() to simplify
the code. There is no functional changes from this clean-up.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -217,21 +217,12 @@ static inline bool compact_trylock_irqsa
/* Returns true if the page is within a block suitable for migration to */
static bool suitable_migration_target(struct page *page)
{
- int migratetype = get_pageblock_migratetype(page);
-
- /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
- if (migratetype == MIGRATE_RESERVE)
- return false;
-
- if (is_migrate_isolate(migratetype))
- return false;
-
- /* If the page is a large free page, then allow migration */
+ /* If the page is a large free page, then disallow migration */
if (PageBuddy(page) && page_order(page) >= pageblock_order)
- return true;
+ return false;

/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
- if (migrate_async_suitable(migratetype))
+ if (migrate_async_suitable(get_pageblock_migratetype(page)))
return true;

/* Otherwise skip the block */

2014-10-07 23:20:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 24/37] drbd: fix regression out of mem, failed to invoke fence-peer helper

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Lars Ellenberg <[email protected]>

commit bbc1c5e8ad6dfebf9d13b8a4ccdf66c92913eac9 upstream.

Since linux kernel 3.13, kthread_run() internally uses
wait_for_completion_killable(). We sometimes may use kthread_run()
while we still have a signal pending, which we used to kick our threads
out of potentially blocking network functions, causing kthread_run() to
mistake that as a new fatal signal and fail.

Fix: flush_signals() before kthread_run().

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/block/drbd/drbd_nl.c | 6 ++++++
1 file changed, 6 insertions(+)

--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -525,6 +525,12 @@ void conn_try_outdate_peer_async(struct
struct task_struct *opa;

kref_get(&tconn->kref);
+ /* We may just have force_sig()'ed this thread
+ * to get it out of some blocking network function.
+ * Clear signals; otherwise kthread_run(), which internally uses
+ * wait_on_completion_killable(), will mistake our pending signal
+ * for a new fatal signal and fail. */
+ flush_signals(current);
opa = kthread_run(_try_outdate_peer_async, tconn, "drbd_async_h");
if (IS_ERR(opa)) {
conn_err(tconn, "out of mem, failed to invoke fence-peer helper\n");

2014-10-07 23:20:22

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 34/37] mm: vmscan: shrink_slab: rename max_pass -> freeable

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Vladimir Davydov <[email protected]>

commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream.

The name `max_pass' is misleading, because this variable actually keeps
the estimate number of freeable objects, not the maximal number of
objects we can scan in this pass, which can be twice that. Rename it to
reflect its actual meaning.

Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/vmscan.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -224,15 +224,15 @@ shrink_slab_node(struct shrink_control *
unsigned long freed = 0;
unsigned long long delta;
long total_scan;
- long max_pass;
+ long freeable;
long nr;
long new_nr;
int nid = shrinkctl->nid;
long batch_size = shrinker->batch ? shrinker->batch
: SHRINK_BATCH;

- max_pass = shrinker->count_objects(shrinker, shrinkctl);
- if (max_pass == 0)
+ freeable = shrinker->count_objects(shrinker, shrinkctl);
+ if (freeable == 0)
return 0;

/*
@@ -244,14 +244,14 @@ shrink_slab_node(struct shrink_control *

total_scan = nr;
delta = (4 * nr_pages_scanned) / shrinker->seeks;
- delta *= max_pass;
+ delta *= freeable;
do_div(delta, lru_pages + 1);
total_scan += delta;
if (total_scan < 0) {
printk(KERN_ERR
"shrink_slab: %pF negative objects to delete nr=%ld\n",
shrinker->scan_objects, total_scan);
- total_scan = max_pass;
+ total_scan = freeable;
}

/*
@@ -260,26 +260,26 @@ shrink_slab_node(struct shrink_control *
* shrinkers to return -1 all the time. This results in a large
* nr being built up so when a shrink that can do some work
* comes along it empties the entire cache due to nr >>>
- * max_pass. This is bad for sustaining a working set in
+ * freeable. This is bad for sustaining a working set in
* memory.
*
* Hence only allow the shrinker to scan the entire cache when
* a large delta change is calculated directly.
*/
- if (delta < max_pass / 4)
- total_scan = min(total_scan, max_pass / 2);
+ if (delta < freeable / 4)
+ total_scan = min(total_scan, freeable / 2);

/*
* Avoid risking looping forever due to too large nr value:
* never try to free more than twice the estimate number of
* freeable entries.
*/
- if (total_scan > max_pass * 2)
- total_scan = max_pass * 2;
+ if (total_scan > freeable * 2)
+ total_scan = freeable * 2;

trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
nr_pages_scanned, lru_pages,
- max_pass, delta, total_scan);
+ freeable, delta, total_scan);

/*
* Normally, we should not scan less than batch_size objects in one
@@ -292,12 +292,12 @@ shrink_slab_node(struct shrink_control *
*
* We detect the "tight on memory" situations by looking at the total
* number of objects we want to scan (total_scan). If it is greater
- * than the total number of objects on slab (max_pass), we must be
+ * than the total number of objects on slab (freeable), we must be
* scanning at high prio and therefore should try to reclaim as much as
* possible.
*/
while (total_scan >= batch_size ||
- total_scan >= max_pass) {
+ total_scan >= freeable) {
unsigned long ret;
unsigned long nr_to_scan = min(batch_size, total_scan);


2014-10-07 23:20:18

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 19/37] lib/plist: add plist_requeue

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Dan Streetman <[email protected]>

commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream.

Add plist_requeue(), which moves the specified plist_node after all other
same-priority plist_nodes in the list. This is essentially an optimized
plist_del() followed by plist_add().

This is needed by swap, which (with the next patch in this set) uses a
plist of available swap devices. When a swap device (either a swap
partition or swap file) are added to the system with swapon(), the device
is added to a plist, ordered by the swap device's priority. When swap
needs to allocate a page from one of the swap devices, it takes the page
from the first swap device on the plist, which is the highest priority
swap device. The swap device is left in the plist until all its pages are
used, and then removed from the plist when it becomes full.

However, as described in man 2 swapon, swap must allocate pages from swap
devices with the same priority in round-robin order; to do this, on each
swap page allocation, swap uses a page from the first swap device in the
plist, and then calls plist_requeue() to move that swap device entry to
after any other same-priority swap devices. The next swap page allocation
will again use a page from the first swap device in the plist and requeue
it, and so on, resulting in round-robin usage of equal-priority swap
devices.

Also add plist_test_requeue() test function, for use by plist_test() to
test plist_requeue() function.

Signed-off-by: Dan Streetman <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christian Ehrhardt <[email protected]>
Cc: Weijie Yang <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/plist.h | 2 +
lib/plist.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)

--- a/include/linux/plist.h
+++ b/include/linux/plist.h
@@ -141,6 +141,8 @@ static inline void plist_node_init(struc
extern void plist_add(struct plist_node *node, struct plist_head *head);
extern void plist_del(struct plist_node *node, struct plist_head *head);

+extern void plist_requeue(struct plist_node *node, struct plist_head *head);
+
/**
* plist_for_each - iterate over the plist
* @pos: the type * to use as a loop counter
--- a/lib/plist.c
+++ b/lib/plist.c
@@ -134,6 +134,46 @@ void plist_del(struct plist_node *node,
plist_check_head(head);
}

+/**
+ * plist_requeue - Requeue @node at end of same-prio entries.
+ *
+ * This is essentially an optimized plist_del() followed by
+ * plist_add(). It moves an entry already in the plist to
+ * after any other same-priority entries.
+ *
+ * @node: &struct plist_node pointer - entry to be moved
+ * @head: &struct plist_head pointer - list head
+ */
+void plist_requeue(struct plist_node *node, struct plist_head *head)
+{
+ struct plist_node *iter;
+ struct list_head *node_next = &head->node_list;
+
+ plist_check_head(head);
+ BUG_ON(plist_head_empty(head));
+ BUG_ON(plist_node_empty(node));
+
+ if (node == plist_last(head))
+ return;
+
+ iter = plist_next(node);
+
+ if (node->prio != iter->prio)
+ return;
+
+ plist_del(node, head);
+
+ plist_for_each_continue(iter, head) {
+ if (node->prio != iter->prio) {
+ node_next = &iter->node_list;
+ break;
+ }
+ }
+ list_add_tail(&node->node_list, node_next);
+
+ plist_check_head(head);
+}
+
#ifdef CONFIG_DEBUG_PI_LIST
#include <linux/sched.h>
#include <linux/module.h>
@@ -170,6 +210,14 @@ static void __init plist_test_check(int
BUG_ON(prio_pos->prio_list.next != &first->prio_list);
}

+static void __init plist_test_requeue(struct plist_node *node)
+{
+ plist_requeue(node, &test_head);
+
+ if (node != plist_last(&test_head))
+ BUG_ON(node->prio == plist_next(node)->prio);
+}
+
static int __init plist_test(void)
{
int nr_expect = 0, i, loop;
@@ -193,6 +241,10 @@ static int __init plist_test(void)
nr_expect--;
}
plist_test_check(nr_expect);
+ if (!plist_node_empty(test_node + i)) {
+ plist_test_requeue(test_node + i);
+ plist_test_check(nr_expect);
+ }
}

for (i = 0; i < ARRAY_SIZE(test_node); i++) {

2014-10-07 23:20:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 20/37] swap: change swap_list_head to plist, add swap_avail_head

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Dan Streetman <[email protected]>

commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream.

Originally get_swap_page() started iterating through the singly-linked
list of swap_info_structs using swap_list.next or highest_priority_index,
which both were intended to point to the highest priority active swap
target that was not full. The first patch in this series changed the
singly-linked list to a doubly-linked list, and removed the logic to start
at the highest priority non-full entry; it starts scanning at the highest
priority entry each time, even if the entry is full.

Replace the manually ordered swap_list_head with a plist, swap_active_head.
Add a new plist, swap_avail_head. The original swap_active_head plist
contains all active swap_info_structs, as before, while the new
swap_avail_head plist contains only swap_info_structs that are active and
available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
the swap_avail_head list.

Mel Gorman suggested using plists since they internally handle ordering
the list entries based on priority, which is exactly what swap was doing
manually. All the ordering code is now removed, and swap_info_struct
entries and simply added to their corresponding plist and automatically
ordered correctly.

Using a new plist for available swap_info_structs simplifies and
optimizes get_swap_page(), which no longer has to iterate over full
swap_info_structs. Using a new spinlock for swap_avail_head plist
allows each swap_info_struct to add or remove themselves from the
plist when they become full or not-full; previously they could not
do so because the swap_info_struct->lock is held when they change
from full<->not-full, and the swap_lock protecting the main
swap_active_head must be ordered before any swap_info_struct->lock.

Signed-off-by: Dan Streetman <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christian Ehrhardt <[email protected]>
Cc: Weijie Yang <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/swap.h | 3
include/linux/swapfile.h | 2
mm/frontswap.c | 6 -
mm/swapfile.c | 145 +++++++++++++++++++++++++++++------------------
4 files changed, 97 insertions(+), 59 deletions(-)

--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -214,7 +214,8 @@ struct percpu_cluster {
struct swap_info_struct {
unsigned long flags; /* SWP_USED etc: see above */
signed short prio; /* swap priority of this type */
- struct list_head list; /* entry in swap list */
+ struct plist_node list; /* entry in swap_active_head */
+ struct plist_node avail_list; /* entry in swap_avail_head */
signed char type; /* strange name for an index */
unsigned int max; /* extent of the swap_map */
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -6,7 +6,7 @@
* want to expose them to the dozens of source files that include swap.h
*/
extern spinlock_t swap_lock;
-extern struct list_head swap_list_head;
+extern struct plist_head swap_active_head;
extern struct swap_info_struct *swap_info[];
extern int try_to_unuse(unsigned int, bool, unsigned long);

--- a/mm/frontswap.c
+++ b/mm/frontswap.c
@@ -331,7 +331,7 @@ static unsigned long __frontswap_curr_pa
struct swap_info_struct *si = NULL;

assert_spin_locked(&swap_lock);
- list_for_each_entry(si, &swap_list_head, list)
+ plist_for_each_entry(si, &swap_active_head, list)
totalpages += atomic_read(&si->frontswap_pages);
return totalpages;
}
@@ -346,7 +346,7 @@ static int __frontswap_unuse_pages(unsig
unsigned long pages = 0, pages_to_unuse = 0;

assert_spin_locked(&swap_lock);
- list_for_each_entry(si, &swap_list_head, list) {
+ plist_for_each_entry(si, &swap_active_head, list) {
si_frontswap_pages = atomic_read(&si->frontswap_pages);
if (total_pages_to_unuse < si_frontswap_pages) {
pages = pages_to_unuse = total_pages_to_unuse;
@@ -408,7 +408,7 @@ void frontswap_shrink(unsigned long targ
/*
* we don't want to hold swap_lock while doing a very
* lengthy try_to_unuse, but swap_list may change
- * so restart scan from swap_list_head each time
+ * so restart scan from swap_active_head each time
*/
spin_lock(&swap_lock);
ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -61,7 +61,22 @@ static const char Unused_offset[] = "Unu
* all active swap_info_structs
* protected with swap_lock, and ordered by priority.
*/
-LIST_HEAD(swap_list_head);
+PLIST_HEAD(swap_active_head);
+
+/*
+ * all available (active, not full) swap_info_structs
+ * protected with swap_avail_lock, ordered by priority.
+ * This is used by get_swap_page() instead of swap_active_head
+ * because swap_active_head includes all swap_info_structs,
+ * but get_swap_page() doesn't need to look at full ones.
+ * This uses its own lock instead of swap_lock because when a
+ * swap_info_struct changes between not-full/full, it needs to
+ * add/remove itself to/from this list, but the swap_info_struct->lock
+ * is held and the locking order requires swap_lock to be taken
+ * before any swap_info_struct->lock.
+ */
+static PLIST_HEAD(swap_avail_head);
+static DEFINE_SPINLOCK(swap_avail_lock);

struct swap_info_struct *swap_info[MAX_SWAPFILES];

@@ -594,6 +609,9 @@ checks:
if (si->inuse_pages == si->pages) {
si->lowest_bit = si->max;
si->highest_bit = 0;
+ spin_lock(&swap_avail_lock);
+ plist_del(&si->avail_list, &swap_avail_head);
+ spin_unlock(&swap_avail_lock);
}
si->swap_map[offset] = usage;
inc_cluster_info_page(si, si->cluster_info, offset);
@@ -645,57 +663,63 @@ swp_entry_t get_swap_page(void)
{
struct swap_info_struct *si, *next;
pgoff_t offset;
- struct list_head *tmp;

- spin_lock(&swap_lock);
if (atomic_long_read(&nr_swap_pages) <= 0)
goto noswap;
atomic_long_dec(&nr_swap_pages);

- list_for_each(tmp, &swap_list_head) {
- si = list_entry(tmp, typeof(*si), list);
+ spin_lock(&swap_avail_lock);
+
+start_over:
+ plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
+ /* requeue si to after same-priority siblings */
+ plist_requeue(&si->avail_list, &swap_avail_head);
+ spin_unlock(&swap_avail_lock);
spin_lock(&si->lock);
if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
+ spin_lock(&swap_avail_lock);
+ if (plist_node_empty(&si->avail_list)) {
+ spin_unlock(&si->lock);
+ goto nextsi;
+ }
+ WARN(!si->highest_bit,
+ "swap_info %d in list but !highest_bit\n",
+ si->type);
+ WARN(!(si->flags & SWP_WRITEOK),
+ "swap_info %d in list but !SWP_WRITEOK\n",
+ si->type);
+ plist_del(&si->avail_list, &swap_avail_head);
spin_unlock(&si->lock);
- continue;
+ goto nextsi;
}

- /*
- * rotate the current swap_info that we're going to use
- * to after any other swap_info that have the same prio,
- * so that all equal-priority swap_info get used equally
- */
- next = si;
- list_for_each_entry_continue(next, &swap_list_head, list) {
- if (si->prio != next->prio)
- break;
- list_rotate_left(&si->list);
- next = si;
- }
-
- spin_unlock(&swap_lock);
/* This is called for allocating swap entry for cache */
offset = scan_swap_map(si, SWAP_HAS_CACHE);
spin_unlock(&si->lock);
if (offset)
return swp_entry(si->type, offset);
- spin_lock(&swap_lock);
+ pr_debug("scan_swap_map of si %d failed to find offset\n",
+ si->type);
+ spin_lock(&swap_avail_lock);
+nextsi:
/*
* if we got here, it's likely that si was almost full before,
* and since scan_swap_map() can drop the si->lock, multiple
* callers probably all tried to get a page from the same si
- * and it filled up before we could get one. So we need to
- * try again. Since we dropped the swap_lock, there may now
- * be non-full higher priority swap_infos, and this si may have
- * even been removed from the list (although very unlikely).
- * Let's start over.
+ * and it filled up before we could get one; or, the si filled
+ * up between us dropping swap_avail_lock and taking si->lock.
+ * Since we dropped the swap_avail_lock, the swap_avail_head
+ * list may have been modified; so if next is still in the
+ * swap_avail_head list then try it, otherwise start over.
*/
- tmp = &swap_list_head;
+ if (plist_node_empty(&next->avail_list))
+ goto start_over;
}

+ spin_unlock(&swap_avail_lock);
+
atomic_long_inc(&nr_swap_pages);
noswap:
- spin_unlock(&swap_lock);
return (swp_entry_t) {0};
}

@@ -798,8 +822,18 @@ static unsigned char swap_entry_free(str
dec_cluster_info_page(p, p->cluster_info, offset);
if (offset < p->lowest_bit)
p->lowest_bit = offset;
- if (offset > p->highest_bit)
+ if (offset > p->highest_bit) {
+ bool was_full = !p->highest_bit;
p->highest_bit = offset;
+ if (was_full && (p->flags & SWP_WRITEOK)) {
+ spin_lock(&swap_avail_lock);
+ WARN_ON(!plist_node_empty(&p->avail_list));
+ if (plist_node_empty(&p->avail_list))
+ plist_add(&p->avail_list,
+ &swap_avail_head);
+ spin_unlock(&swap_avail_lock);
+ }
+ }
atomic_long_inc(&nr_swap_pages);
p->inuse_pages--;
frontswap_invalidate_page(p->type, offset);
@@ -1734,12 +1768,16 @@ static void _enable_swap_info(struct swa
unsigned char *swap_map,
struct swap_cluster_info *cluster_info)
{
- struct swap_info_struct *si;
-
if (prio >= 0)
p->prio = prio;
else
p->prio = --least_priority;
+ /*
+ * the plist prio is negated because plist ordering is
+ * low-to-high, while swap ordering is high-to-low
+ */
+ p->list.prio = -p->prio;
+ p->avail_list.prio = -p->prio;
p->swap_map = swap_map;
p->cluster_info = cluster_info;
p->flags |= SWP_WRITEOK;
@@ -1747,27 +1785,20 @@ static void _enable_swap_info(struct swa
total_swap_pages += p->pages;

assert_spin_locked(&swap_lock);
- BUG_ON(!list_empty(&p->list));
- /*
- * insert into swap list; the list is in priority order,
- * so that get_swap_page() can get a page from the highest
- * priority swap_info_struct with available page(s), and
- * swapoff can adjust the auto-assigned (i.e. negative) prio
- * values for any lower-priority swap_info_structs when
- * removing a negative-prio swap_info_struct
- */
- list_for_each_entry(si, &swap_list_head, list) {
- if (p->prio >= si->prio) {
- list_add_tail(&p->list, &si->list);
- return;
- }
- }
/*
- * this covers two cases:
- * 1) p->prio is less than all existing prio
- * 2) the swap list is empty
+ * both lists are plists, and thus priority ordered.
+ * swap_active_head needs to be priority ordered for swapoff(),
+ * which on removal of any swap_info_struct with an auto-assigned
+ * (i.e. negative) priority increments the auto-assigned priority
+ * of any lower-priority swap_info_structs.
+ * swap_avail_head needs to be priority ordered for get_swap_page(),
+ * which allocates swap pages from the highest available priority
+ * swap_info_struct.
*/
- list_add_tail(&p->list, &swap_list_head);
+ plist_add(&p->list, &swap_active_head);
+ spin_lock(&swap_avail_lock);
+ plist_add(&p->avail_list, &swap_avail_head);
+ spin_unlock(&swap_avail_lock);
}

static void enable_swap_info(struct swap_info_struct *p, int prio,
@@ -1821,7 +1852,7 @@ SYSCALL_DEFINE1(swapoff, const char __us

mapping = victim->f_mapping;
spin_lock(&swap_lock);
- list_for_each_entry(p, &swap_list_head, list) {
+ plist_for_each_entry(p, &swap_active_head, list) {
if (p->flags & SWP_WRITEOK) {
if (p->swap_file->f_mapping == mapping) {
found = 1;
@@ -1841,16 +1872,21 @@ SYSCALL_DEFINE1(swapoff, const char __us
spin_unlock(&swap_lock);
goto out_dput;
}
+ spin_lock(&swap_avail_lock);
+ plist_del(&p->avail_list, &swap_avail_head);
+ spin_unlock(&swap_avail_lock);
spin_lock(&p->lock);
if (p->prio < 0) {
struct swap_info_struct *si = p;

- list_for_each_entry_continue(si, &swap_list_head, list) {
+ plist_for_each_entry_continue(si, &swap_active_head, list) {
si->prio++;
+ si->list.prio--;
+ si->avail_list.prio--;
}
least_priority++;
}
- list_del_init(&p->list);
+ plist_del(&p->list, &swap_active_head);
atomic_long_sub(p->pages, &nr_swap_pages);
total_swap_pages -= p->pages;
p->flags &= ~SWP_WRITEOK;
@@ -2115,7 +2151,8 @@ static struct swap_info_struct *alloc_sw
*/
}
INIT_LIST_HEAD(&p->first_swap_extent.list);
- INIT_LIST_HEAD(&p->list);
+ plist_node_init(&p->list, 0);
+ plist_node_init(&p->avail_list, 0);
p->flags = SWP_USED;
spin_unlock(&swap_lock);
spin_lock_init(&p->lock);

2014-10-07 23:20:13

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 23/37] mm/compaction: do not call suitable_migration_target() on every page

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Joonsoo Kim <[email protected]>

commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream.

suitable_migration_target() checks that pageblock is suitable for
migration target. In isolate_freepages_block(), it is called on every
page and this is inefficient. So make it called once per pageblock.

suitable_migration_target() also checks if page is highorder or not, but
it's criteria for highorder is pageblock order. So calling it once
within pageblock range has no problem.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -244,6 +244,7 @@ static unsigned long isolate_freepages_b
struct page *cursor, *valid_page = NULL;
unsigned long flags;
bool locked = false;
+ bool checked_pageblock = false;

cursor = pfn_to_page(blockpfn);

@@ -275,8 +276,16 @@ static unsigned long isolate_freepages_b
break;

/* Recheck this is a suitable migration target under lock */
- if (!strict && !suitable_migration_target(page))
- break;
+ if (!strict && !checked_pageblock) {
+ /*
+ * We need to check suitability of pageblock only once
+ * and this isolate_freepages_block() is called with
+ * pageblock range, so just check once is sufficient.
+ */
+ checked_pageblock = true;
+ if (!suitable_migration_target(page))
+ break;
+ }

/* Recheck this is a buddy page under lock */
if (!PageBuddy(page))

2014-10-07 23:20:11

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 21/37] mm, compaction: avoid isolating pinned pages

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: David Rientjes <[email protected]>

commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream.

Page migration will fail for memory that is pinned in memory with, for
example, get_user_pages(). In this case, it is unnecessary to take
zone->lru_lock or isolating the page and passing it to page migration
which will ultimately fail.

This is a racy check, the page can still change from under us, but in
that case we'll just fail later when attempting to move the page.

This avoids very expensive memory compaction when faulting transparent
hugepages after pinning a lot of memory with a Mellanox driver.

On a 128GB machine and pinning ~120GB of memory, before this patch we
see the enormous disparity in the number of page migration failures
because of the pinning (from /proc/vmstat):

compact_pages_moved 8450
compact_pagemigrate_failed 15614415

0.05% of pages isolated are successfully migrated and explicitly
triggering memory compaction takes 102 seconds. After the patch:

compact_pages_moved 9197
compact_pagemigrate_failed 7

99.9% of pages isolated are now successfully migrated in this
configuration and memory compaction takes less than one second.

Signed-off-by: David Rientjes <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Greg Thelen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 9 +++++++++
1 file changed, 9 insertions(+)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -584,6 +584,15 @@ isolate_migratepages_range(struct zone *
continue;
}

+ /*
+ * Migration will fail if an anonymous page is pinned in memory,
+ * so avoid taking lru_lock and isolating it unnecessarily in an
+ * admittedly racy check.
+ */
+ if (!page_mapping(page) &&
+ page_count(page) > page_mapcount(page))
+ continue;
+
/* Check if it is ok to still hold the lock */
locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
locked, cc);

2014-10-07 23:26:41

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 25/37] mm/compaction: change the timing to check to drop the spinlock

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Joonsoo Kim <[email protected]>

commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream.

It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
pfn page. This may results in below situation while isolating
migratepage.

1. try isolate 0x0 ~ 0x200 pfn pages.
2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
the spinlock.
3. Then, to complete isolating, retry to aquire the lock.

I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
criteria about dropping the lock. This has no harm 0x0 pfn, because, at
this time, locked variable would be false.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -487,7 +487,7 @@ isolate_migratepages_range(struct zone *
cond_resched();
for (; low_pfn < end_pfn; low_pfn++) {
/* give a chance to irqs before checking need_resched() */
- if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+ if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
if (should_release_lock(&zone->lru_lock)) {
spin_unlock_irqrestore(&zone->lru_lock, flags);
locked = false;

2014-10-07 23:26:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 33/37] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Vladimir Davydov <[email protected]>

commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream.

When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware. That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good. Fix this.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2424,8 +2424,8 @@ static unsigned long do_try_to_free_page
unsigned long lru_pages = 0;

nodes_clear(shrink->nodes_to_scan);
- for_each_zone_zonelist(zone, z, zonelist,
- gfp_zone(sc->gfp_mask)) {
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ gfp_zone(sc->gfp_mask), sc->nodemask) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;


2014-10-07 23:27:18

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 31/37] mm: optimize put_mems_allowed() usage

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Mel Gorman <[email protected]>

commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/cpuset.h | 27 ++++++++++++++-------------
kernel/cpuset.c | 2 +-
mm/filemap.c | 4 ++--
mm/hugetlb.c | 4 ++--
mm/mempolicy.c | 12 ++++++------
mm/page_alloc.c | 8 ++++----
mm/slab.c | 4 ++--
mm/slub.c | 16 +++++++---------
8 files changed, 38 insertions(+), 39 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -87,25 +87,26 @@ extern void rebuild_sched_domains(void);
extern void cpuset_print_task_mems_allowed(struct task_struct *p);

/*
- * get_mems_allowed is required when making decisions involving mems_allowed
- * such as during page allocation. mems_allowed can be updated in parallel
- * and depending on the new value an operation can fail potentially causing
- * process failure. A retry loop with get_mems_allowed and put_mems_allowed
- * prevents these artificial failures.
+ * read_mems_allowed_begin is required when making decisions involving
+ * mems_allowed such as during page allocation. mems_allowed can be updated in
+ * parallel and depending on the new value an operation can fail potentially
+ * causing process failure. A retry loop with read_mems_allowed_begin and
+ * read_mems_allowed_retry prevents these artificial failures.
*/
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
{
return read_seqcount_begin(&current->mems_allowed_seq);
}

/*
- * If this returns false, the operation that took place after get_mems_allowed
- * may have failed. It is up to the caller to retry the operation if
+ * If this returns true, the operation that took place after
+ * read_mems_allowed_begin may have failed artificially due to a concurrent
+ * update of mems_allowed. It is up to the caller to retry the operation if
* appropriate.
*/
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
{
- return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+ return read_seqcount_retry(&current->mems_allowed_seq, seq);
}

static inline void set_mems_allowed(nodemask_t nodemask)
@@ -225,14 +226,14 @@ static inline void set_mems_allowed(node
{
}

-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
{
return 0;
}

-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
{
- return true;
+ return false;
}

#endif /* !CONFIG_CPUSETS */
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1022,7 +1022,7 @@ static void cpuset_change_task_nodemask(
task_lock(tsk);
/*
* Determine if a loop is necessary if another thread is doing
- * get_mems_allowed(). If at least one node remains unchanged and
+ * read_mems_allowed_begin(). If at least one node remains unchanged and
* tsk does not have a mempolicy, then an empty nodemask will not be
* possible when mems_allowed is larger than a word.
*/
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -520,10 +520,10 @@ struct page *__page_cache_alloc(gfp_t gf
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
page = alloc_pages_exact_node(n, gfp, 0);
- } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+ } while (!page && read_mems_allowed_retry(cpuset_mems_cookie));

return page;
}
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -540,7 +540,7 @@ static struct page *dequeue_huge_page_vm
goto err;

retry_cpuset:
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();
zonelist = huge_zonelist(vma, address,
htlb_alloc_mask(h), &mpol, &nodemask);

@@ -562,7 +562,7 @@ retry_cpuset:
}

mpol_cond_put(mpol);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
goto retry_cpuset;
return page;

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1897,7 +1897,7 @@ int node_random(const nodemask_t *maskp)
* If the effective policy is 'BIND, returns a pointer to the mempolicy's
* @nodemask for filtering the zonelist.
*
- * Must be protected by get_mems_allowed()
+ * Must be protected by read_mems_allowed_begin()
*/
struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
gfp_t gfp_flags, struct mempolicy **mpol,
@@ -2061,7 +2061,7 @@ alloc_pages_vma(gfp_t gfp, int order, st

retry_cpuset:
pol = get_vma_policy(current, vma, addr);
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();

if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
unsigned nid;
@@ -2069,7 +2069,7 @@ retry_cpuset:
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
mpol_cond_put(pol);
page = alloc_page_interleave(gfp, order, nid);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
goto retry_cpuset;

return page;
@@ -2079,7 +2079,7 @@ retry_cpuset:
policy_nodemask(gfp, pol));
if (unlikely(mpol_needs_cond_ref(pol)))
__mpol_put(pol);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
goto retry_cpuset;
return page;
}
@@ -2113,7 +2113,7 @@ struct page *alloc_pages_current(gfp_t g
pol = &default_policy;

retry_cpuset:
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();

/*
* No reference counting needed for current->mempolicy
@@ -2126,7 +2126,7 @@ retry_cpuset:
policy_zonelist(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));

- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
goto retry_cpuset;

return page;
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2736,7 +2736,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
return NULL;

retry_cpuset:
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();

/* The preferred zone is used for statistics later */
first_zones_zonelist(zonelist, high_zoneidx,
@@ -2791,7 +2791,7 @@ out:
* the mask is being updated. If a page allocation is about to fail,
* check if the cpuset changed during allocation and if so, retry.
*/
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
goto retry_cpuset;

memcg_kmem_commit_charge(page, memcg, order);
@@ -3059,9 +3059,9 @@ bool skip_free_areas_node(unsigned int f
goto out;

do {
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();
ret = !node_isset(nid, cpuset_current_mems_allowed);
- } while (!put_mems_allowed(cpuset_mems_cookie));
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
out:
return ret;
}
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3122,7 +3122,7 @@ static void *fallback_alloc(struct kmem_
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);

retry_cpuset:
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();
zonelist = node_zonelist(slab_node(), flags);

retry:
@@ -3180,7 +3180,7 @@ retry:
}
}

- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+ if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
goto retry_cpuset;
return obj;
}
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1684,7 +1684,7 @@ static void *get_any_partial(struct kmem
return NULL;

do {
- cpuset_mems_cookie = get_mems_allowed();
+ cpuset_mems_cookie = read_mems_allowed_begin();
zonelist = node_zonelist(slab_node(), flags);
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;
@@ -1696,19 +1696,17 @@ static void *get_any_partial(struct kmem
object = get_partial_node(s, n, c, flags);
if (object) {
/*
- * Return the object even if
- * put_mems_allowed indicated that
- * the cpuset mems_allowed was
- * updated in parallel. It's a
- * harmless race between the alloc
- * and the cpuset update.
+ * Don't check read_mems_allowed_retry()
+ * here - if mems_allowed was updated in
+ * parallel, that was a harmless race
+ * between allocation and the cpuset
+ * update
*/
- put_mems_allowed(cpuset_mems_cookie);
return object;
}
}
}
- } while (!put_mems_allowed(cpuset_mems_cookie));
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
#endif
return NULL;
}

2014-10-07 23:27:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 30/37] mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Raghavendra K T <[email protected]>

commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream.

Currently max_sane_readahead() returns zero on the cpu whose NUMA node
has no local memory which leads to readahead failure. Fix this
readahead failure by returning minimum of (requested pages, 512). Users
running applications on a memory-less cpu which needs readahead such as
streaming application see considerable boost in the performance.

Result:

fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.

fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
the normal NUMA cases w/ patch.

Kernel Avg Stddev
base 7.4975 3.92%
patched 7.4174 3.26%

[Andrew: making return value PAGE_SIZE independent]
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/readahead.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -233,14 +233,14 @@ int force_page_cache_readahead(struct ad
return 0;
}

+#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE)
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
- return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
- + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+ return min(nr, MAX_READAHEAD);
}

/*

2014-10-07 23:27:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 32/37] mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jens Axboe <[email protected]>

commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream.

In some testing I ran today (some fio jobs that spread over two nodes),
we end up spending 40% of the time in filemap_check_errors(). That
smells fishy. Looking further, this is basically what happens:

blkdev_aio_read()
generic_file_aio_read()
filemap_write_and_wait_range()
if (!mapping->nr_pages)
filemap_check_errors()

and filemap_check_errors() always attempts two test_and_clear_bit() on
the mapping flags, thus dirtying it for every single invocation. The
patch below tests each of these bits before clearing them, avoiding this
issue. In my test case (4-socket box), performance went from 1.7M IOPS
to 4.0M IOPS.

Signed-off-by: Jens Axboe <[email protected]>
Acked-by: Jeff Moyer <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/filemap.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -192,9 +192,11 @@ static int filemap_check_errors(struct a
{
int ret = 0;
/* Check for outstanding write errors */
- if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
+ if (test_bit(AS_ENOSPC, &mapping->flags) &&
+ test_and_clear_bit(AS_ENOSPC, &mapping->flags))
ret = -ENOSPC;
- if (test_and_clear_bit(AS_EIO, &mapping->flags))
+ if (test_bit(AS_EIO, &mapping->flags) &&
+ test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
return ret;
}

2014-10-07 23:28:09

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 28/37] mm, compaction: determine isolation mode only once

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: David Rientjes <[email protected]>

commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream.

The conditions that control the isolation mode in
isolate_migratepages_range() do not change during the iteration, so
extract them out and only define the value once.

This actually does have an effect, gcc doesn't optimize it itself because
of cc->sync.

Signed-off-by: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -460,12 +460,13 @@ isolate_migratepages_range(struct zone *
unsigned long last_pageblock_nr = 0, pageblock_nr;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct list_head *migratelist = &cc->migratepages;
- isolate_mode_t mode = 0;
struct lruvec *lruvec;
unsigned long flags;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
bool skipped_async_unsuitable = false;
+ const isolate_mode_t mode = (!cc->sync ? ISOLATE_ASYNC_MIGRATE : 0) |
+ (unevictable ? ISOLATE_UNEVICTABLE : 0);

/*
* Ensure that there are not too many pages isolated from the LRU
@@ -608,12 +609,6 @@ isolate_migratepages_range(struct zone *
continue;
}

- if (!cc->sync)
- mode |= ISOLATE_ASYNC_MIGRATE;
-
- if (unevictable)
- mode |= ISOLATE_UNEVICTABLE;
-
lruvec = mem_cgroup_page_lruvec(page, zone);

/* Try isolate the page */

2014-10-07 23:28:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 29/37] mm, compaction: ignore pageblock skip when manually invoking compaction

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: David Rientjes <[email protected]>

commit 91ca9186484809c57303b33778d841cc28f696ed upstream.

The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).

Signed-off-by: David Rientjes <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 1 +
1 file changed, 1 insertion(+)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1193,6 +1193,7 @@ static void compact_node(int nid)
struct compact_control cc = {
.order = -1,
.sync = true,
+ .ignore_skip_hint = true,
};

__compact_pgdat(NODE_DATA(nid), &cc);

2014-10-07 23:28:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 36/37] mm: per-thread vma caching

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Davidlohr Bueso <[email protected]>

commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[[email protected]: fix nommu build, per Davidlohr]
[[email protected]: document vmacache_valid() logic]
[[email protected]: attempt to untangle header files]
[[email protected]: add vmacache_find() BUG_ON]
[[email protected]: add vmacache_valid_mm() (from Oleg)]
[[email protected]: coding-style fixes]
[[email protected]: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Tested-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/unicore32/include/asm/mmu_context.h | 4 -
fs/exec.c | 5 +
fs/proc/task_mmu.c | 3
include/linux/mm_types.h | 4 -
include/linux/sched.h | 7 +
include/linux/vmacache.h | 38 ++++++++++
kernel/debug/debug_core.c | 14 +++
kernel/fork.c | 7 +
mm/Makefile | 2
mm/mmap.c | 51 +++++++-------
mm/nommu.c | 24 ++++--
mm/vmacache.c | 112 +++++++++++++++++++++++++++++++
12 files changed, 229 insertions(+), 42 deletions(-)

--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -14,6 +14,8 @@

#include <linux/compiler.h>
#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmacache.h>
#include <linux/io.h>

#include <asm/cacheflush.h>
@@ -73,7 +75,7 @@ do { \
else \
mm->mmap = NULL; \
rb_erase(&high_vma->vm_rb, &mm->mm_rb); \
- mm->mmap_cache = NULL; \
+ vmacache_invalidate(mm); \
mm->map_count--; \
remove_vma(high_vma); \
} \
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -26,6 +26,7 @@
#include <linux/file.h>
#include <linux/fdtable.h>
#include <linux/mm.h>
+#include <linux/vmacache.h>
#include <linux/stat.h>
#include <linux/fcntl.h>
#include <linux/swap.h>
@@ -820,7 +821,7 @@ EXPORT_SYMBOL(read_code);
static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
- struct mm_struct * old_mm, *active_mm;
+ struct mm_struct *old_mm, *active_mm;

/* Notify parent that we're no longer interested in the old VM */
tsk = current;
@@ -846,6 +847,8 @@ static int exec_mmap(struct mm_struct *m
tsk->mm = mm;
tsk->active_mm = mm;
activate_mm(active_mm, mm);
+ tsk->mm->vmacache_seqnum = 0;
+ vmacache_flush(tsk);
task_unlock(tsk);
if (old_mm) {
up_read(&old_mm->mmap_sem);
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1,4 +1,5 @@
#include <linux/mm.h>
+#include <linux/vmacache.h>
#include <linux/hugetlb.h>
#include <linux/huge_mm.h>
#include <linux/mount.h>
@@ -152,7 +153,7 @@ static void *m_start(struct seq_file *m,

/*
* We remember last_addr rather than next_addr to hit with
- * mmap_cache most of the time. We have zero last_addr at
+ * vmacache most of the time. We have zero last_addr at
* the beginning and also after lseek. We will have -1 last_addr
* after the end of the vmas.
*/
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,9 +342,9 @@ struct mm_rss_stat {

struct kioctx_table;
struct mm_struct {
- struct vm_area_struct * mmap; /* list of VMAs */
+ struct vm_area_struct *mmap; /* list of VMAs */
struct rb_root mm_rb;
- struct vm_area_struct * mmap_cache; /* last find_vma result */
+ u32 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -59,6 +59,10 @@ struct sched_param {

#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */

+#define VMACACHE_BITS 2
+#define VMACACHE_SIZE (1U << VMACACHE_BITS)
+#define VMACACHE_MASK (VMACACHE_SIZE - 1)
+
/*
* Extended scheduling parameters data structure.
*
@@ -1228,6 +1232,9 @@ struct task_struct {
#ifdef CONFIG_COMPAT_BRK
unsigned brk_randomized:1;
#endif
+ /* per-thread vma caching */
+ u32 vmacache_seqnum;
+ struct vm_area_struct *vmacache[VMACACHE_SIZE];
#if defined(SPLIT_RSS_COUNTING)
struct task_rss_stat rss_stat;
#endif
--- /dev/null
+++ b/include/linux/vmacache.h
@@ -0,0 +1,38 @@
+#ifndef __LINUX_VMACACHE_H
+#define __LINUX_VMACACHE_H
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+
+/*
+ * Hash based on the page number. Provides a good hit rate for
+ * workloads with good locality and those with random accesses as well.
+ */
+#define VMACACHE_HASH(addr) ((addr >> PAGE_SHIFT) & VMACACHE_MASK)
+
+static inline void vmacache_flush(struct task_struct *tsk)
+{
+ memset(tsk->vmacache, 0, sizeof(tsk->vmacache));
+}
+
+extern void vmacache_flush_all(struct mm_struct *mm);
+extern void vmacache_update(unsigned long addr, struct vm_area_struct *newvma);
+extern struct vm_area_struct *vmacache_find(struct mm_struct *mm,
+ unsigned long addr);
+
+#ifndef CONFIG_MMU
+extern struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
+#endif
+
+static inline void vmacache_invalidate(struct mm_struct *mm)
+{
+ mm->vmacache_seqnum++;
+
+ /* deal with overflows */
+ if (unlikely(mm->vmacache_seqnum == 0))
+ vmacache_flush_all(mm);
+}
+
+#endif /* __LINUX_VMACACHE_H */
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -49,6 +49,7 @@
#include <linux/pid.h>
#include <linux/smp.h>
#include <linux/mm.h>
+#include <linux/vmacache.h>
#include <linux/rcupdate.h>

#include <asm/cacheflush.h>
@@ -224,10 +225,17 @@ static void kgdb_flush_swbreak_addr(unsi
if (!CACHE_FLUSH_IS_SAFE)
return;

- if (current->mm && current->mm->mmap_cache) {
- flush_cache_range(current->mm->mmap_cache,
- addr, addr + BREAK_INSTR_SIZE);
+ if (current->mm) {
+ int i;
+
+ for (i = 0; i < VMACACHE_SIZE; i++) {
+ if (!current->vmacache[i])
+ continue;
+ flush_cache_range(current->vmacache[i],
+ addr, addr + BREAK_INSTR_SIZE);
+ }
}
+
/* Force flush instruction cache if it was outside the mm */
flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
}
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -28,6 +28,8 @@
#include <linux/mman.h>
#include <linux/mmu_notifier.h>
#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/vmacache.h>
#include <linux/nsproxy.h>
#include <linux/capability.h>
#include <linux/cpu.h>
@@ -363,7 +365,7 @@ static int dup_mmap(struct mm_struct *mm

mm->locked_vm = 0;
mm->mmap = NULL;
- mm->mmap_cache = NULL;
+ mm->vmacache_seqnum = 0;
mm->map_count = 0;
cpumask_clear(mm_cpumask(mm));
mm->mm_rb = RB_ROOT;
@@ -876,6 +878,9 @@ static int copy_mm(unsigned long clone_f
if (!oldmm)
return 0;

+ /* initialize the new vmacache entries */
+ vmacache_flush(tsk);
+
if (clone_flags & CLONE_VM) {
atomic_inc(&oldmm->mm_users);
mm = oldmm;
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y := filemap.o mempool.o oom_kill.
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o mmu_context.o percpu.o slab_common.o \
- compaction.o balloon_compaction.o \
+ compaction.o balloon_compaction.o vmacache.o \
interval_tree.o list_lru.o $(mmu-y)

obj-y += init-mm.o
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -10,6 +10,7 @@
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <linux/mm.h>
+#include <linux/vmacache.h>
#include <linux/shm.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
@@ -681,8 +682,9 @@ __vma_unlink(struct mm_struct *mm, struc
prev->vm_next = next = vma->vm_next;
if (next)
next->vm_prev = prev;
- if (mm->mmap_cache == vma)
- mm->mmap_cache = prev;
+
+ /* Kill the cache */
+ vmacache_invalidate(mm);
}

/*
@@ -1989,34 +1991,33 @@ EXPORT_SYMBOL(get_unmapped_area);
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
- struct vm_area_struct *vma = NULL;
+ struct rb_node *rb_node;
+ struct vm_area_struct *vma;

/* Check the cache first. */
- /* (Cache hit rate is typically around 35%.) */
- vma = ACCESS_ONCE(mm->mmap_cache);
- if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
- struct rb_node *rb_node;
+ vma = vmacache_find(mm, addr);
+ if (likely(vma))
+ return vma;

- rb_node = mm->mm_rb.rb_node;
- vma = NULL;
+ rb_node = mm->mm_rb.rb_node;
+ vma = NULL;

- while (rb_node) {
- struct vm_area_struct *vma_tmp;
+ while (rb_node) {
+ struct vm_area_struct *tmp;

- vma_tmp = rb_entry(rb_node,
- struct vm_area_struct, vm_rb);
+ tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);

- if (vma_tmp->vm_end > addr) {
- vma = vma_tmp;
- if (vma_tmp->vm_start <= addr)
- break;
- rb_node = rb_node->rb_left;
- } else
- rb_node = rb_node->rb_right;
- }
- if (vma)
- mm->mmap_cache = vma;
+ if (tmp->vm_end > addr) {
+ vma = tmp;
+ if (tmp->vm_start <= addr)
+ break;
+ rb_node = rb_node->rb_left;
+ } else
+ rb_node = rb_node->rb_right;
}
+
+ if (vma)
+ vmacache_update(addr, vma);
return vma;
}

@@ -2388,7 +2389,9 @@ detach_vmas_to_be_unmapped(struct mm_str
} else
mm->highest_vm_end = prev ? prev->vm_end : 0;
tail_vma->vm_next = NULL;
- mm->mmap_cache = NULL; /* Kill the cache. */
+
+ /* Kill the cache */
+ vmacache_invalidate(mm);
}

/*
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -15,6 +15,7 @@

#include <linux/export.h>
#include <linux/mm.h>
+#include <linux/vmacache.h>
#include <linux/mman.h>
#include <linux/swap.h>
#include <linux/file.h>
@@ -768,16 +769,23 @@ static void add_vma_to_mm(struct mm_stru
*/
static void delete_vma_from_mm(struct vm_area_struct *vma)
{
+ int i;
struct address_space *mapping;
struct mm_struct *mm = vma->vm_mm;
+ struct task_struct *curr = current;

kenter("%p", vma);

protect_vma(vma, 0);

mm->map_count--;
- if (mm->mmap_cache == vma)
- mm->mmap_cache = NULL;
+ for (i = 0; i < VMACACHE_SIZE; i++) {
+ /* if the vma is cached, invalidate the entire cache */
+ if (curr->vmacache[i] == vma) {
+ vmacache_invalidate(curr->mm);
+ break;
+ }
+ }

/* remove the VMA from the mapping */
if (vma->vm_file) {
@@ -825,8 +833,8 @@ struct vm_area_struct *find_vma(struct m
struct vm_area_struct *vma;

/* check the cache first */
- vma = ACCESS_ONCE(mm->mmap_cache);
- if (vma && vma->vm_start <= addr && vma->vm_end > addr)
+ vma = vmacache_find(mm, addr);
+ if (likely(vma))
return vma;

/* trawl the list (there may be multiple mappings in which addr
@@ -835,7 +843,7 @@ struct vm_area_struct *find_vma(struct m
if (vma->vm_start > addr)
return NULL;
if (vma->vm_end > addr) {
- mm->mmap_cache = vma;
+ vmacache_update(addr, vma);
return vma;
}
}
@@ -874,8 +882,8 @@ static struct vm_area_struct *find_vma_e
unsigned long end = addr + len;

/* check the cache first */
- vma = mm->mmap_cache;
- if (vma && vma->vm_start == addr && vma->vm_end == end)
+ vma = vmacache_find_exact(mm, addr, end);
+ if (vma)
return vma;

/* trawl the list (there may be multiple mappings in which addr
@@ -886,7 +894,7 @@ static struct vm_area_struct *find_vma_e
if (vma->vm_start > addr)
return NULL;
if (vma->vm_end == end) {
- mm->mmap_cache = vma;
+ vmacache_update(addr, vma);
return vma;
}
}
--- /dev/null
+++ b/mm/vmacache.c
@@ -0,0 +1,112 @@
+/*
+ * Copyright (C) 2014 Davidlohr Bueso.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmacache.h>
+
+/*
+ * Flush vma caches for threads that share a given mm.
+ *
+ * The operation is safe because the caller holds the mmap_sem
+ * exclusively and other threads accessing the vma cache will
+ * have mmap_sem held at least for read, so no extra locking
+ * is required to maintain the vma cache.
+ */
+void vmacache_flush_all(struct mm_struct *mm)
+{
+ struct task_struct *g, *p;
+
+ rcu_read_lock();
+ for_each_process_thread(g, p) {
+ /*
+ * Only flush the vmacache pointers as the
+ * mm seqnum is already set and curr's will
+ * be set upon invalidation when the next
+ * lookup is done.
+ */
+ if (mm == p->mm)
+ vmacache_flush(p);
+ }
+ rcu_read_unlock();
+}
+
+/*
+ * This task may be accessing a foreign mm via (for example)
+ * get_user_pages()->find_vma(). The vmacache is task-local and this
+ * task's vmacache pertains to a different mm (ie, its own). There is
+ * nothing we can do here.
+ *
+ * Also handle the case where a kernel thread has adopted this mm via use_mm().
+ * That kernel thread's vmacache is not applicable to this mm.
+ */
+static bool vmacache_valid_mm(struct mm_struct *mm)
+{
+ return current->mm == mm && !(current->flags & PF_KTHREAD);
+}
+
+void vmacache_update(unsigned long addr, struct vm_area_struct *newvma)
+{
+ if (vmacache_valid_mm(newvma->vm_mm))
+ current->vmacache[VMACACHE_HASH(addr)] = newvma;
+}
+
+static bool vmacache_valid(struct mm_struct *mm)
+{
+ struct task_struct *curr;
+
+ if (!vmacache_valid_mm(mm))
+ return false;
+
+ curr = current;
+ if (mm->vmacache_seqnum != curr->vmacache_seqnum) {
+ /*
+ * First attempt will always be invalid, initialize
+ * the new cache for this task here.
+ */
+ curr->vmacache_seqnum = mm->vmacache_seqnum;
+ vmacache_flush(curr);
+ return false;
+ }
+ return true;
+}
+
+struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
+{
+ int i;
+
+ if (!vmacache_valid(mm))
+ return NULL;
+
+ for (i = 0; i < VMACACHE_SIZE; i++) {
+ struct vm_area_struct *vma = current->vmacache[i];
+
+ if (vma && vma->vm_start <= addr && vma->vm_end > addr) {
+ BUG_ON(vma->vm_mm != mm);
+ return vma;
+ }
+ }
+
+ return NULL;
+}
+
+#ifndef CONFIG_MMU
+struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ int i;
+
+ if (!vmacache_valid(mm))
+ return NULL;
+
+ for (i = 0; i < VMACACHE_SIZE; i++) {
+ struct vm_area_struct *vma = current->vmacache[i];
+
+ if (vma && vma->vm_start == start && vma->vm_end == end)
+ return vma;
+ }
+
+ return NULL;
+}
+#endif

2014-10-07 23:28:45

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 37/37] mm: dont pointlessly use BUG_ON() for sanity check

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Linus Torvalds <[email protected]>

commit 50f5aa8a9b248fa4262cf379863ec9a531b49737 upstream.

BUG_ON() is a big hammer, and should be used _only_ if there is some
major corruption that you cannot possibly recover from, making it
imperative that the current process (and possibly the whole machine) be
terminated with extreme prejudice.

The trivial sanity check in the vmacache code is *not* such a fatal
error. Recovering from it is absolutely trivial, and using BUG_ON()
just makes it harder to debug for no actual advantage.

To make matters worse, the placement of the BUG_ON() (only if the range
check matched) actually makes it harder to hit the sanity check to begin
with, so _if_ there is a bug (and we just got a report from Srivatsa
Bhat that this can indeed trigger), it is harder to debug not just
because the machine is possibly dead, but because we don't have better
coverage.

BUG_ON() must *die*. Maybe we should add a checkpatch warning for it,
because it is simply just about the worst thing you can ever do if you
hit some "this cannot happen" situation.

Reported-by: Srivatsa S. Bhat <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/vmacache.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

--- a/mm/vmacache.c
+++ b/mm/vmacache.c
@@ -81,10 +81,12 @@ struct vm_area_struct *vmacache_find(str
for (i = 0; i < VMACACHE_SIZE; i++) {
struct vm_area_struct *vma = current->vmacache[i];

- if (vma && vma->vm_start <= addr && vma->vm_end > addr) {
- BUG_ON(vma->vm_mm != mm);
+ if (!vma)
+ continue;
+ if (WARN_ON_ONCE(vma->vm_mm != mm))
+ break;
+ if (vma->vm_start <= addr && vma->vm_end > addr)
return vma;
- }
}

return NULL;

2014-10-07 23:28:43

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 27/37] mm/compaction: clean-up code on success of ballon isolation

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Joonsoo Kim <[email protected]>

commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream.

It is just for clean-up to reduce code size and improve readability.
There is no functional change.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -562,11 +562,7 @@ isolate_migratepages_range(struct zone *
if (unlikely(balloon_page_movable(page))) {
if (locked && balloon_page_isolate(page)) {
/* Successfully isolated */
- cc->finished_update_migrate = true;
- list_add(&page->lru, migratelist);
- cc->nr_migratepages++;
- nr_isolated++;
- goto check_compact_cluster;
+ goto isolate_success;
}
}
continue;
@@ -627,13 +623,14 @@ isolate_migratepages_range(struct zone *
VM_BUG_ON_PAGE(PageTransCompound(page), page);

/* Successfully isolated */
- cc->finished_update_migrate = true;
del_page_from_lru_list(page, lruvec, page_lru(page));
+
+isolate_success:
+ cc->finished_update_migrate = true;
list_add(&page->lru, migratelist);
cc->nr_migratepages++;
nr_isolated++;

-check_compact_cluster:
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;

2014-10-07 23:29:35

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 35/37] vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Christoph Lameter <[email protected]>

commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream.

Seems to be called with preemption enabled. Therefore it must use
mod_zone_page_state instead.

Signed-off-by: Christoph Lameter <[email protected]>
Reported-by: Grygorii Strashko <[email protected]>
Tested-by: Grygorii Strashko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Santosh Shilimkar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1144,7 +1144,7 @@ unsigned long reclaim_clean_pages_from_l
TTU_UNMAP|TTU_IGNORE_ACCESS,
&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
list_splice(&clean_pages, page_list);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
+ mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
return ret;
}


2014-10-07 23:29:50

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 26/37] mm/compaction: check pageblock suitability once per pageblock

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Joonsoo Kim <[email protected]>

commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream.

isolation_suitable() and migrate_async_suitable() is used to be sure
that this pageblock range is fine to be migragted. It isn't needed to
call it on every page. Current code do well if not suitable, but, don't
do well when suitable.

1) It re-checks isolation_suitable() on each page of a pageblock that was
already estabilished as suitable.
2) It re-checks migrate_async_suitable() on each page of a pageblock that
was not entered through the next_pageblock: label, because
last_pageblock_nr is not otherwise updated.

This patch fixes situation by 1) calling isolation_suitable() only once
per pageblock and 2) always updating last_pageblock_nr to the pageblock
that was just checked.

Additionally, move PageBuddy() check after pageblock unit check, since
pageblock check is the first thing we should do and makes things more
simple.

[[email protected]: rephrase commit description]
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/compaction.c | 34 +++++++++++++++++++---------------
1 file changed, 19 insertions(+), 15 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -526,8 +526,25 @@ isolate_migratepages_range(struct zone *

/* If isolation recently failed, do not retry */
pageblock_nr = low_pfn >> pageblock_order;
- if (!isolation_suitable(cc, page))
- goto next_pageblock;
+ if (last_pageblock_nr != pageblock_nr) {
+ int mt;
+
+ last_pageblock_nr = pageblock_nr;
+ if (!isolation_suitable(cc, page))
+ goto next_pageblock;
+
+ /*
+ * For async migration, also only scan in MOVABLE
+ * blocks. Async migration is optimistic to see if
+ * the minimum amount of work satisfies the allocation
+ */
+ mt = get_pageblock_migratetype(page);
+ if (!cc->sync && !migrate_async_suitable(mt)) {
+ cc->finished_update_migrate = true;
+ skipped_async_unsuitable = true;
+ goto next_pageblock;
+ }
+ }

/*
* Skip if free. page_order cannot be used without zone->lock
@@ -537,18 +554,6 @@ isolate_migratepages_range(struct zone *
continue;

/*
- * For async migration, also only scan in MOVABLE blocks. Async
- * migration is optimistic to see if the minimum amount of work
- * satisfies the allocation
- */
- if (!cc->sync && last_pageblock_nr != pageblock_nr &&
- !migrate_async_suitable(get_pageblock_migratetype(page))) {
- cc->finished_update_migrate = true;
- skipped_async_unsuitable = true;
- goto next_pageblock;
- }
-
- /*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU pages and balloon pages
* Skip any other type of page
@@ -639,7 +644,6 @@ check_compact_cluster:

next_pageblock:
low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
- last_pageblock_nr = pageblock_nr;
}

acct_isolated(zone, locked, cc);

2014-10-07 23:30:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 14/37] media: vb2: fix VBI/poll regression

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Hans Verkuil <[email protected]>

commit 58d75f4b1ce26324b4d809b18f94819843a98731 upstream.

The recent conversion of saa7134 to vb2 unconvered a poll() bug that
broke the teletext applications alevt and mtt. These applications
expect that calling poll() without having called VIDIOC_STREAMON will
cause poll() to return POLLERR. That did not happen in vb2.

This patch fixes that behavior. It also fixes what should happen when
poll() is called when STREAMON is called but no buffers have been
queued. In that case poll() will also return POLLERR, but only for
capture queues since output queues will always return POLLOUT
anyway in that situation.

This brings the vb2 behavior in line with the old videobuf behavior.

Signed-off-by: Hans Verkuil <[email protected]>
Acked-by: Laurent Pinchart <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/media/v4l2-core/videobuf2-core.c | 15 +++++++++++++--
include/media/videobuf2-core.h | 4 ++++
2 files changed, 17 insertions(+), 2 deletions(-)

--- a/drivers/media/v4l2-core/videobuf2-core.c
+++ b/drivers/media/v4l2-core/videobuf2-core.c
@@ -745,6 +745,7 @@ static int __reqbufs(struct vb2_queue *q
* to the userspace.
*/
req->count = allocated_buffers;
+ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);

return 0;
}
@@ -793,6 +794,7 @@ static int __create_bufs(struct vb2_queu
memset(q->plane_sizes, 0, sizeof(q->plane_sizes));
memset(q->alloc_ctx, 0, sizeof(q->alloc_ctx));
q->memory = create->memory;
+ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
}

num_buffers = min(create->count, VIDEO_MAX_FRAME - q->num_buffers);
@@ -1447,6 +1449,7 @@ static int vb2_internal_qbuf(struct vb2_
* dequeued in dqbuf.
*/
list_add_tail(&vb->queued_entry, &q->queued_list);
+ q->waiting_for_buffers = false;
vb->state = VB2_BUF_STATE_QUEUED;

/*
@@ -1841,6 +1844,7 @@ static int vb2_internal_streamoff(struct
* and videobuf, effectively returning control over them to userspace.
*/
__vb2_queue_cancel(q);
+ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);

dprintk(3, "Streamoff successful\n");
return 0;
@@ -2150,9 +2154,16 @@ unsigned int vb2_poll(struct vb2_queue *
}

/*
- * There is nothing to wait for if no buffers have already been queued.
+ * There is nothing to wait for if the queue isn't streaming.
*/
- if (list_empty(&q->queued_list))
+ if (!vb2_is_streaming(q))
+ return res | POLLERR;
+ /*
+ * For compatibility with vb1: if QBUF hasn't been called yet, then
+ * return POLLERR as well. This only affects capture queues, output
+ * queues will always initialize waiting_for_buffers to false.
+ */
+ if (q->waiting_for_buffers)
return res | POLLERR;

if (list_empty(&q->done_list))
--- a/include/media/videobuf2-core.h
+++ b/include/media/videobuf2-core.h
@@ -329,6 +329,9 @@ struct v4l2_fh;
* @retry_start_streaming: start_streaming() was called, but there were not enough
* buffers queued. If set, then retry calling start_streaming when
* queuing a new buffer.
+ * @waiting_for_buffers: used in poll() to check if vb2 is still waiting for
+ * buffers. Only set for capture queues if qbuf has not yet been
+ * called since poll() needs to return POLLERR in that situation.
* @fileio: file io emulator internal data, used only if emulator is active
*/
struct vb2_queue {
@@ -362,6 +365,7 @@ struct vb2_queue {

unsigned int streaming:1;
unsigned int retry_start_streaming:1;
+ unsigned int waiting_for_buffers:1;

struct vb2_fileio_data *fileio;
};

2014-10-07 23:30:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 11/37] hugetlb: ensure hugepage access is denied if hugepages are not supported

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Nishanth Aravamudan <[email protected]>

commit 457c1b27ed56ec472d202731b12417bff023594a upstream.

Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:

Unable to handle kernel paging request for data at address 0x00000031
Faulting instruction address: 0xc000000000245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
....

In KVM guests on Power, in a guest not backed by hugepages, we see the
following:

AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 64 kB

HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init(). Extract the check to a helper function, and use it in a
few relevant places.

This does make hugetlbfs not supported (not registered at all) in this
environment. I believe this is fine, as there are no valid hugepages
and that won't change at runtime.

[[email protected]: use pr_info(), per Mel]
[[email protected]: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/hugetlbfs/inode.c | 5 +++++
include/linux/hugetlb.h | 10 ++++++++++
mm/hugetlb.c | 13 +++++++++++++
3 files changed, 28 insertions(+)

--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1017,6 +1017,11 @@ static int __init init_hugetlbfs_fs(void
int error;
int i;

+ if (!hugepages_supported()) {
+ pr_info("hugetlbfs: disabling because there are no supported hugepage sizes\n");
+ return -ENOTSUPP;
+ }
+
error = bdi_init(&hugetlbfs_backing_dev_info);
if (error)
return error;
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -400,6 +400,16 @@ static inline spinlock_t *huge_pte_lockp
return &mm->page_table_lock;
}

+static inline bool hugepages_supported(void)
+{
+ /*
+ * Some platform decide whether they support huge pages at boot
+ * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
+ * there is no such support
+ */
+ return HPAGE_SHIFT != 0;
+}
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2071,6 +2071,9 @@ static int hugetlb_sysctl_handler_common
unsigned long tmp;
int ret;

+ if (!hugepages_supported())
+ return -ENOTSUPP;
+
tmp = h->max_huge_pages;

if (write && h->order >= MAX_ORDER)
@@ -2124,6 +2127,9 @@ int hugetlb_overcommit_handler(struct ct
unsigned long tmp;
int ret;

+ if (!hugepages_supported())
+ return -ENOTSUPP;
+
tmp = h->nr_overcommit_huge_pages;

if (write && h->order >= MAX_ORDER)
@@ -2149,6 +2155,8 @@ out:
void hugetlb_report_meminfo(struct seq_file *m)
{
struct hstate *h = &default_hstate;
+ if (!hugepages_supported())
+ return;
seq_printf(m,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
@@ -2165,6 +2173,8 @@ void hugetlb_report_meminfo(struct seq_f
int hugetlb_report_node_meminfo(int nid, char *buf)
{
struct hstate *h = &default_hstate;
+ if (!hugepages_supported())
+ return 0;
return sprintf(buf,
"Node %d HugePages_Total: %5u\n"
"Node %d HugePages_Free: %5u\n"
@@ -2179,6 +2189,9 @@ void hugetlb_show_meminfo(void)
struct hstate *h;
int nid;

+ if (!hugepages_supported())
+ return;
+
for_each_node_state(nid, N_MEMORY)
for_each_hstate(h)
pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n",

2014-10-07 23:30:57

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 13/37] mm: numa: Do not mark PTEs pte_numa when splitting huge pages

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Mel Gorman <[email protected]>

commit abc40bd2eeb77eb7c2effcaf63154aad929a1d5f upstream.

This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.

VM_RW|VM_PROTNONE
|-----------------|
^
split here

In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.

Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/huge_memory.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1826,14 +1826,17 @@ static int __split_huge_page_map(struct
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t *pte, entry;
BUG_ON(PageCompound(page+i));
+ /*
+ * Note that pmd_numa is not transferred deliberately
+ * to avoid any possibility that pte_numa leaks to
+ * a PROT_NONE VMA by accident.
+ */
entry = mk_pte(page + i, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
- if (pmd_numa(*pmd))
- entry = pte_mknuma(entry);
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);

2014-10-07 23:20:03

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 09/37] ring-buffer: Fix infinite spin in reading buffer

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: "Steven Rostedt (Red Hat)" <[email protected]>

commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream.

Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.

A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.

The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.

Commit 651e22f2701b changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.

Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
kernel/trace/ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3372,7 +3372,7 @@ static void rb_iter_reset(struct ring_bu
iter->head = cpu_buffer->reader_page->read;

iter->cache_reader_page = iter->head_page;
- iter->cache_read = iter->head;
+ iter->cache_read = cpu_buffer->read;

if (iter->head)
iter->read_stamp = cpu_buffer->read_stamp;

2014-10-07 23:31:34

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 10/37] CIFS: Fix SMB2 readdir error handling

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Pavel Shilovsky <[email protected]>

commit 52755808d4525f4d5b86d112d36ffc7a46f3fb48 upstream.

SMB2 servers indicates the end of a directory search with
STATUS_NO_MORE_FILE error code that is not processed now.
This causes generic/257 xfstest to fail. Fix this by triggering
the end of search by this error code in SMB2_query_directory.

Also when negotiating CIFS protocol we tell the server to close
the search automatically at the end and there is no need to do
it itself. In the case of SMB2 protocol, we need to close it
explicitly - separate close directory checks for different
protocols.

Signed-off-by: Pavel Shilovsky <[email protected]>
Signed-off-by: Steve French <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>


---
fs/cifs/cifsglob.h | 2 ++
fs/cifs/file.c | 2 +-
fs/cifs/readdir.c | 2 +-
fs/cifs/smb1ops.c | 7 +++++++
fs/cifs/smb2maperror.c | 2 +-
fs/cifs/smb2ops.c | 9 +++++++++
fs/cifs/smb2pdu.c | 9 ++++-----
7 files changed, 25 insertions(+), 8 deletions(-)

--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -399,6 +399,8 @@ struct smb_version_operations {
const struct cifs_fid *, u32 *);
int (*set_acl)(struct cifs_ntsd *, __u32, struct inode *, const char *,
int);
+ /* check if we need to issue closedir */
+ bool (*dir_needs_close)(struct cifsFileInfo *);
};

struct smb_version_values {
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -762,7 +762,7 @@ int cifs_closedir(struct inode *inode, s

cifs_dbg(FYI, "Freeing private data in close dir\n");
spin_lock(&cifs_file_list_lock);
- if (!cfile->srch_inf.endOfSearch && !cfile->invalidHandle) {
+ if (server->ops->dir_needs_close(cfile)) {
cfile->invalidHandle = true;
spin_unlock(&cifs_file_list_lock);
if (server->ops->close_dir)
--- a/fs/cifs/readdir.c
+++ b/fs/cifs/readdir.c
@@ -593,7 +593,7 @@ find_cifs_entry(const unsigned int xid,
/* close and restart search */
cifs_dbg(FYI, "search backing up - close and restart search\n");
spin_lock(&cifs_file_list_lock);
- if (!cfile->srch_inf.endOfSearch && !cfile->invalidHandle) {
+ if (server->ops->dir_needs_close(cfile)) {
cfile->invalidHandle = true;
spin_unlock(&cifs_file_list_lock);
if (server->ops->close_dir)
--- a/fs/cifs/smb1ops.c
+++ b/fs/cifs/smb1ops.c
@@ -1009,6 +1009,12 @@ cifs_is_read_op(__u32 oplock)
return oplock == OPLOCK_READ;
}

+static bool
+cifs_dir_needs_close(struct cifsFileInfo *cfile)
+{
+ return !cfile->srch_inf.endOfSearch && !cfile->invalidHandle;
+}
+
struct smb_version_operations smb1_operations = {
.send_cancel = send_nt_cancel,
.compare_fids = cifs_compare_fids,
@@ -1078,6 +1084,7 @@ struct smb_version_operations smb1_opera
.query_mf_symlink = cifs_query_mf_symlink,
.create_mf_symlink = cifs_create_mf_symlink,
.is_read_op = cifs_is_read_op,
+ .dir_needs_close = cifs_dir_needs_close,
#ifdef CONFIG_CIFS_XATTR
.query_all_EAs = CIFSSMBQAllEAs,
.set_EA = CIFSSMBSetEA,
--- a/fs/cifs/smb2maperror.c
+++ b/fs/cifs/smb2maperror.c
@@ -214,7 +214,7 @@ static const struct status_to_posix_erro
{STATUS_BREAKPOINT, -EIO, "STATUS_BREAKPOINT"},
{STATUS_SINGLE_STEP, -EIO, "STATUS_SINGLE_STEP"},
{STATUS_BUFFER_OVERFLOW, -EIO, "STATUS_BUFFER_OVERFLOW"},
- {STATUS_NO_MORE_FILES, -EIO, "STATUS_NO_MORE_FILES"},
+ {STATUS_NO_MORE_FILES, -ENODATA, "STATUS_NO_MORE_FILES"},
{STATUS_WAKE_SYSTEM_DEBUGGER, -EIO, "STATUS_WAKE_SYSTEM_DEBUGGER"},
{STATUS_HANDLES_CLOSED, -EIO, "STATUS_HANDLES_CLOSED"},
{STATUS_NO_INHERITANCE, -EIO, "STATUS_NO_INHERITANCE"},
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -1102,6 +1102,12 @@ smb3_parse_lease_buf(void *buf, unsigned
return le32_to_cpu(lc->lcontext.LeaseState);
}

+static bool
+smb2_dir_needs_close(struct cifsFileInfo *cfile)
+{
+ return !cfile->invalidHandle;
+}
+
struct smb_version_operations smb20_operations = {
.compare_fids = smb2_compare_fids,
.setup_request = smb2_setup_request,
@@ -1175,6 +1181,7 @@ struct smb_version_operations smb20_oper
.create_lease_buf = smb2_create_lease_buf,
.parse_lease_buf = smb2_parse_lease_buf,
.clone_range = smb2_clone_range,
+ .dir_needs_close = smb2_dir_needs_close,
};

struct smb_version_operations smb21_operations = {
@@ -1250,6 +1257,7 @@ struct smb_version_operations smb21_oper
.create_lease_buf = smb2_create_lease_buf,
.parse_lease_buf = smb2_parse_lease_buf,
.clone_range = smb2_clone_range,
+ .dir_needs_close = smb2_dir_needs_close,
};

struct smb_version_operations smb30_operations = {
@@ -1328,6 +1336,7 @@ struct smb_version_operations smb30_oper
.parse_lease_buf = smb3_parse_lease_buf,
.clone_range = smb2_clone_range,
.validate_negotiate = smb3_validate_negotiate,
+ .dir_needs_close = smb2_dir_needs_close,
};

struct smb_version_values smb20_values = {
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2136,6 +2136,10 @@ SMB2_query_directory(const unsigned int
rsp = (struct smb2_query_directory_rsp *)iov[0].iov_base;

if (rc) {
+ if (rc == -ENODATA && rsp->hdr.Status == STATUS_NO_MORE_FILES) {
+ srch_inf->endOfSearch = true;
+ rc = 0;
+ }
cifs_stats_fail_inc(tcon, SMB2_QUERY_DIRECTORY_HE);
goto qdir_exit;
}
@@ -2173,11 +2177,6 @@ SMB2_query_directory(const unsigned int
else
cifs_dbg(VFS, "illegal search buffer type\n");

- if (rsp->hdr.Status == STATUS_NO_MORE_FILES)
- srch_inf->endOfSearch = 1;
- else
- srch_inf->endOfSearch = 0;
-
return rc;

qdir_exit:

2014-10-07 23:31:32

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 12/37] mm, thp: move invariant bug check out of loop in __split_huge_page_map

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Waiman Long <[email protected]>

commit f8303c2582b889351e261ff18c4d8eb197a77db2 upstream.

In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop. Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.

This patch moves the invariant bug check out of the loop so that it will
be done only once. On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively. The performance gain is about 1%.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Scott J Norton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/huge_memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1819,6 +1819,8 @@ static int __split_huge_page_map(struct
if (pmd) {
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
+ if (pmd_write(*pmd))
+ BUG_ON(page_mapcount(page) != 1);

haddr = address;
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1828,8 +1830,6 @@ static int __split_huge_page_map(struct
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
- else
- BUG_ON(page_mapcount(page) != 1);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
if (pmd_numa(*pmd))

2014-10-07 23:32:24

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 18/37] lib/plist: add helper functions

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Dan Streetman <[email protected]>

commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream.

Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
define and initialize a struct plist_head.

Add plist_for_each_continue() and plist_for_each_entry_continue(),
equivalent to list_for_each_continue() and list_for_each_entry_continue(),
to iterate over a plist continuing after the current position.

Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
and ->next, implemented by list_prev_entry() and list_next_entry(), to
access the prev/next struct plist_node entry. These are needed because
unlike struct list_head, direct access of the prev/next struct plist_node
isn't possible; the list must be navigated via the contained struct
list_head. e.g. instead of accessing the prev by list_prev_entry(node,
node_list) it can be accessed by plist_prev(node).

Signed-off-by: Dan Streetman <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christian Ehrhardt <[email protected]>
Cc: Weijie Yang <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/plist.h | 43 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)

--- a/include/linux/plist.h
+++ b/include/linux/plist.h
@@ -98,6 +98,13 @@ struct plist_node {
}

/**
+ * PLIST_HEAD - declare and init plist_head
+ * @head: name for struct plist_head variable
+ */
+#define PLIST_HEAD(head) \
+ struct plist_head head = PLIST_HEAD_INIT(head)
+
+/**
* PLIST_NODE_INIT - static struct plist_node initializer
* @node: struct plist_node variable name
* @__prio: initial node priority
@@ -143,6 +150,16 @@ extern void plist_del(struct plist_node
list_for_each_entry(pos, &(head)->node_list, node_list)

/**
+ * plist_for_each_continue - continue iteration over the plist
+ * @pos: the type * to use as a loop cursor
+ * @head: the head for your list
+ *
+ * Continue to iterate over plist, continuing after the current position.
+ */
+#define plist_for_each_continue(pos, head) \
+ list_for_each_entry_continue(pos, &(head)->node_list, node_list)
+
+/**
* plist_for_each_safe - iterate safely over a plist of given type
* @pos: the type * to use as a loop counter
* @n: another type * to use as temporary storage
@@ -163,6 +180,18 @@ extern void plist_del(struct plist_node
list_for_each_entry(pos, &(head)->node_list, mem.node_list)

/**
+ * plist_for_each_entry_continue - continue iteration over list of given type
+ * @pos: the type * to use as a loop cursor
+ * @head: the head for your list
+ * @m: the name of the list_struct within the struct
+ *
+ * Continue to iterate over list of given type, continuing after
+ * the current position.
+ */
+#define plist_for_each_entry_continue(pos, head, m) \
+ list_for_each_entry_continue(pos, &(head)->node_list, m.node_list)
+
+/**
* plist_for_each_entry_safe - iterate safely over list of given type
* @pos: the type * to use as a loop counter
* @n: another type * to use as temporary storage
@@ -229,6 +258,20 @@ static inline int plist_node_empty(const
#endif

/**
+ * plist_next - get the next entry in list
+ * @pos: the type * to cursor
+ */
+#define plist_next(pos) \
+ list_next_entry(pos, node_list)
+
+/**
+ * plist_prev - get the prev entry in list
+ * @pos: the type * to cursor
+ */
+#define plist_prev(pos) \
+ list_prev_entry(pos, node_list)
+
+/**
* plist_first - return the first node (and thus, highest priority)
* @head: the &struct plist_head pointer
*

2014-10-07 23:32:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 17/37] swap: change swap_info singly-linked list to list_head

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Dan Streetman <[email protected]>

commit adfab836f4908deb049a5128082719e689eed964 upstream.

The logic controlling the singly-linked list of swap_info_struct entries
for all active, i.e. swapon'ed, swap targets is rather complex, because:

- it stores the entries in priority order
- there is a pointer to the highest priority entry
- there is a pointer to the highest priority not-full entry
- there is a highest_priority_index variable set outside the swap_lock
- swap entries of equal priority should be used equally

this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
where different priority swap targets are incorrectly used equally.

That bug probably could be solved with the existing singly-linked lists,
but I think it would only add more complexity to the already difficult to
understand get_swap_page() swap_list iteration logic.

The first patch changes from a singly-linked list to a doubly-linked list
using list_heads; the highest_priority_index and related code are removed
and get_swap_page() starts each iteration at the highest priority
swap_info entry, even if it's full. While this does introduce unnecessary
list iteration (i.e. Schlemiel the painter's algorithm) in the case where
one or more of the highest priority entries are full, the iteration and
manipulation code is much simpler and behaves correctly re: the above bug;
and the fourth patch removes the unnecessary iteration.

The second patch adds some minor plist helper functions; nothing new
really, just functions to match existing regular list functions. These
are used by the next two patches.

The third patch adds plist_requeue(), which is used by get_swap_page() in
the next patch - it performs the requeueing of same-priority entries
(which moves the entry to the end of its priority in the plist), so that
all equal-priority swap_info_structs get used equally.

The fourth patch converts the main list into a plist, and adds a new plist
that contains only swap_info entries that are both active and not full.
As Mel suggested using plists allows removing all the ordering code from
swap - plists handle ordering automatically. The list naming is also
clarified now that there are two lists, with the original list changed
from swap_list_head to swap_active_head and the new list named
swap_avail_head. A new spinlock is also added for the new list, so
swap_info entries can be added or removed from the new list immediately as
they become full or not full.

This patch (of 4):

Replace the singly-linked list tracking active, i.e. swapon'ed,
swap_info_struct entries with a doubly-linked list using struct
list_heads. Simplify the logic iterating and manipulating the list of
entries, especially get_swap_page(), by using standard list_head
functions, and removing the highest priority iteration logic.

The change fixes the bug:
https://lkml.org/lkml/2014/2/13/181
in which different priority swap entries after the highest priority entry
are incorrectly used equally in pairs. The swap behavior is now as
advertised, i.e. different priority swap entries are used in order, and
equal priority swap targets are used concurrently.

Signed-off-by: Dan Streetman <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christian Ehrhardt <[email protected]>
Cc: Weijie Yang <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/swap.h | 7 -
include/linux/swapfile.h | 2
mm/frontswap.c | 13 +--
mm/swapfile.c | 171 +++++++++++++++++++----------------------------
4 files changed, 78 insertions(+), 115 deletions(-)

--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -214,8 +214,8 @@ struct percpu_cluster {
struct swap_info_struct {
unsigned long flags; /* SWP_USED etc: see above */
signed short prio; /* swap priority of this type */
+ struct list_head list; /* entry in swap list */
signed char type; /* strange name for an index */
- signed char next; /* next type on the swap list */
unsigned int max; /* extent of the swap_map */
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
@@ -255,11 +255,6 @@ struct swap_info_struct {
struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
};

-struct swap_list_t {
- int head; /* head of priority-ordered swapfile list */
- int next; /* swapfile to be used next */
-};
-
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalreserve_pages;
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -6,7 +6,7 @@
* want to expose them to the dozens of source files that include swap.h
*/
extern spinlock_t swap_lock;
-extern struct swap_list_t swap_list;
+extern struct list_head swap_list_head;
extern struct swap_info_struct *swap_info[];
extern int try_to_unuse(unsigned int, bool, unsigned long);

--- a/mm/frontswap.c
+++ b/mm/frontswap.c
@@ -327,15 +327,12 @@ EXPORT_SYMBOL(__frontswap_invalidate_are

static unsigned long __frontswap_curr_pages(void)
{
- int type;
unsigned long totalpages = 0;
struct swap_info_struct *si = NULL;

assert_spin_locked(&swap_lock);
- for (type = swap_list.head; type >= 0; type = si->next) {
- si = swap_info[type];
+ list_for_each_entry(si, &swap_list_head, list)
totalpages += atomic_read(&si->frontswap_pages);
- }
return totalpages;
}

@@ -347,11 +344,9 @@ static int __frontswap_unuse_pages(unsig
int si_frontswap_pages;
unsigned long total_pages_to_unuse = total;
unsigned long pages = 0, pages_to_unuse = 0;
- int type;

assert_spin_locked(&swap_lock);
- for (type = swap_list.head; type >= 0; type = si->next) {
- si = swap_info[type];
+ list_for_each_entry(si, &swap_list_head, list) {
si_frontswap_pages = atomic_read(&si->frontswap_pages);
if (total_pages_to_unuse < si_frontswap_pages) {
pages = pages_to_unuse = total_pages_to_unuse;
@@ -366,7 +361,7 @@ static int __frontswap_unuse_pages(unsig
}
vm_unacct_memory(pages);
*unused = pages_to_unuse;
- *swapid = type;
+ *swapid = si->type;
ret = 0;
break;
}
@@ -413,7 +408,7 @@ void frontswap_shrink(unsigned long targ
/*
* we don't want to hold swap_lock while doing a very
* lengthy try_to_unuse, but swap_list may change
- * so restart scan from swap_list.head each time
+ * so restart scan from swap_list_head each time
*/
spin_lock(&swap_lock);
ret = __frontswap_shrink(target_pages, &pages_to_unuse, &type);
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -51,14 +51,17 @@ atomic_long_t nr_swap_pages;
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
long total_swap_pages;
static int least_priority;
-static atomic_t highest_priority_index = ATOMIC_INIT(-1);

static const char Bad_file[] = "Bad swap file entry ";
static const char Unused_file[] = "Unused swap file entry ";
static const char Bad_offset[] = "Bad swap offset entry ";
static const char Unused_offset[] = "Unused swap offset entry ";

-struct swap_list_t swap_list = {-1, -1};
+/*
+ * all active swap_info_structs
+ * protected with swap_lock, and ordered by priority.
+ */
+LIST_HEAD(swap_list_head);

struct swap_info_struct *swap_info[MAX_SWAPFILES];

@@ -640,66 +643,54 @@ no_page:

swp_entry_t get_swap_page(void)
{
- struct swap_info_struct *si;
+ struct swap_info_struct *si, *next;
pgoff_t offset;
- int type, next;
- int wrapped = 0;
- int hp_index;
+ struct list_head *tmp;

spin_lock(&swap_lock);
if (atomic_long_read(&nr_swap_pages) <= 0)
goto noswap;
atomic_long_dec(&nr_swap_pages);

- for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
- hp_index = atomic_xchg(&highest_priority_index, -1);
- /*
- * highest_priority_index records current highest priority swap
- * type which just frees swap entries. If its priority is
- * higher than that of swap_list.next swap type, we use it. It
- * isn't protected by swap_lock, so it can be an invalid value
- * if the corresponding swap type is swapoff. We double check
- * the flags here. It's even possible the swap type is swapoff
- * and swapon again and its priority is changed. In such rare
- * case, low prority swap type might be used, but eventually
- * high priority swap will be used after several rounds of
- * swap.
- */
- if (hp_index != -1 && hp_index != type &&
- swap_info[type]->prio < swap_info[hp_index]->prio &&
- (swap_info[hp_index]->flags & SWP_WRITEOK)) {
- type = hp_index;
- swap_list.next = type;
- }
-
- si = swap_info[type];
- next = si->next;
- if (next < 0 ||
- (!wrapped && si->prio != swap_info[next]->prio)) {
- next = swap_list.head;
- wrapped++;
- }
-
+ list_for_each(tmp, &swap_list_head) {
+ si = list_entry(tmp, typeof(*si), list);
spin_lock(&si->lock);
- if (!si->highest_bit) {
- spin_unlock(&si->lock);
- continue;
- }
- if (!(si->flags & SWP_WRITEOK)) {
+ if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
spin_unlock(&si->lock);
continue;
}

- swap_list.next = next;
+ /*
+ * rotate the current swap_info that we're going to use
+ * to after any other swap_info that have the same prio,
+ * so that all equal-priority swap_info get used equally
+ */
+ next = si;
+ list_for_each_entry_continue(next, &swap_list_head, list) {
+ if (si->prio != next->prio)
+ break;
+ list_rotate_left(&si->list);
+ next = si;
+ }

spin_unlock(&swap_lock);
/* This is called for allocating swap entry for cache */
offset = scan_swap_map(si, SWAP_HAS_CACHE);
spin_unlock(&si->lock);
if (offset)
- return swp_entry(type, offset);
+ return swp_entry(si->type, offset);
spin_lock(&swap_lock);
- next = swap_list.next;
+ /*
+ * if we got here, it's likely that si was almost full before,
+ * and since scan_swap_map() can drop the si->lock, multiple
+ * callers probably all tried to get a page from the same si
+ * and it filled up before we could get one. So we need to
+ * try again. Since we dropped the swap_lock, there may now
+ * be non-full higher priority swap_infos, and this si may have
+ * even been removed from the list (although very unlikely).
+ * Let's start over.
+ */
+ tmp = &swap_list_head;
}

atomic_long_inc(&nr_swap_pages);
@@ -766,27 +757,6 @@ out:
return NULL;
}

-/*
- * This swap type frees swap entry, check if it is the highest priority swap
- * type which just frees swap entry. get_swap_page() uses
- * highest_priority_index to search highest priority swap type. The
- * swap_info_struct.lock can't protect us if there are multiple swap types
- * active, so we use atomic_cmpxchg.
- */
-static void set_highest_priority_index(int type)
-{
- int old_hp_index, new_hp_index;
-
- do {
- old_hp_index = atomic_read(&highest_priority_index);
- if (old_hp_index != -1 &&
- swap_info[old_hp_index]->prio >= swap_info[type]->prio)
- break;
- new_hp_index = type;
- } while (atomic_cmpxchg(&highest_priority_index,
- old_hp_index, new_hp_index) != old_hp_index);
-}
-
static unsigned char swap_entry_free(struct swap_info_struct *p,
swp_entry_t entry, unsigned char usage)
{
@@ -830,7 +800,6 @@ static unsigned char swap_entry_free(str
p->lowest_bit = offset;
if (offset > p->highest_bit)
p->highest_bit = offset;
- set_highest_priority_index(p->type);
atomic_long_inc(&nr_swap_pages);
p->inuse_pages--;
frontswap_invalidate_page(p->type, offset);
@@ -1765,7 +1734,7 @@ static void _enable_swap_info(struct swa
unsigned char *swap_map,
struct swap_cluster_info *cluster_info)
{
- int i, prev;
+ struct swap_info_struct *si;

if (prio >= 0)
p->prio = prio;
@@ -1777,18 +1746,28 @@ static void _enable_swap_info(struct swa
atomic_long_add(p->pages, &nr_swap_pages);
total_swap_pages += p->pages;

- /* insert swap space into swap_list: */
- prev = -1;
- for (i = swap_list.head; i >= 0; i = swap_info[i]->next) {
- if (p->prio >= swap_info[i]->prio)
- break;
- prev = i;
+ assert_spin_locked(&swap_lock);
+ BUG_ON(!list_empty(&p->list));
+ /*
+ * insert into swap list; the list is in priority order,
+ * so that get_swap_page() can get a page from the highest
+ * priority swap_info_struct with available page(s), and
+ * swapoff can adjust the auto-assigned (i.e. negative) prio
+ * values for any lower-priority swap_info_structs when
+ * removing a negative-prio swap_info_struct
+ */
+ list_for_each_entry(si, &swap_list_head, list) {
+ if (p->prio >= si->prio) {
+ list_add_tail(&p->list, &si->list);
+ return;
+ }
}
- p->next = i;
- if (prev < 0)
- swap_list.head = swap_list.next = p->type;
- else
- swap_info[prev]->next = p->type;
+ /*
+ * this covers two cases:
+ * 1) p->prio is less than all existing prio
+ * 2) the swap list is empty
+ */
+ list_add_tail(&p->list, &swap_list_head);
}

static void enable_swap_info(struct swap_info_struct *p, int prio,
@@ -1823,8 +1802,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
struct address_space *mapping;
struct inode *inode;
struct filename *pathname;
- int i, type, prev;
- int err;
+ int err, found = 0;
unsigned int old_block_size;

if (!capable(CAP_SYS_ADMIN))
@@ -1842,17 +1820,16 @@ SYSCALL_DEFINE1(swapoff, const char __us
goto out;

mapping = victim->f_mapping;
- prev = -1;
spin_lock(&swap_lock);
- for (type = swap_list.head; type >= 0; type = swap_info[type]->next) {
- p = swap_info[type];
+ list_for_each_entry(p, &swap_list_head, list) {
if (p->flags & SWP_WRITEOK) {
- if (p->swap_file->f_mapping == mapping)
+ if (p->swap_file->f_mapping == mapping) {
+ found = 1;
break;
+ }
}
- prev = type;
}
- if (type < 0) {
+ if (!found) {
err = -EINVAL;
spin_unlock(&swap_lock);
goto out_dput;
@@ -1864,20 +1841,16 @@ SYSCALL_DEFINE1(swapoff, const char __us
spin_unlock(&swap_lock);
goto out_dput;
}
- if (prev < 0)
- swap_list.head = p->next;
- else
- swap_info[prev]->next = p->next;
- if (type == swap_list.next) {
- /* just pick something that's safe... */
- swap_list.next = swap_list.head;
- }
spin_lock(&p->lock);
if (p->prio < 0) {
- for (i = p->next; i >= 0; i = swap_info[i]->next)
- swap_info[i]->prio = p->prio--;
+ struct swap_info_struct *si = p;
+
+ list_for_each_entry_continue(si, &swap_list_head, list) {
+ si->prio++;
+ }
least_priority++;
}
+ list_del_init(&p->list);
atomic_long_sub(p->pages, &nr_swap_pages);
total_swap_pages -= p->pages;
p->flags &= ~SWP_WRITEOK;
@@ -1885,7 +1858,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
spin_unlock(&swap_lock);

set_current_oom_origin();
- err = try_to_unuse(type, false, 0); /* force all pages to be unused */
+ err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
clear_current_oom_origin();

if (err) {
@@ -1926,7 +1899,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
frontswap_map = frontswap_map_get(p);
spin_unlock(&p->lock);
spin_unlock(&swap_lock);
- frontswap_invalidate_area(type);
+ frontswap_invalidate_area(p->type);
frontswap_map_set(p, NULL);
mutex_unlock(&swapon_mutex);
free_percpu(p->percpu_cluster);
@@ -1935,7 +1908,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
vfree(cluster_info);
vfree(frontswap_map);
/* Destroy swap account information */
- swap_cgroup_swapoff(type);
+ swap_cgroup_swapoff(p->type);

inode = mapping->host;
if (S_ISBLK(inode->i_mode)) {
@@ -2142,8 +2115,8 @@ static struct swap_info_struct *alloc_sw
*/
}
INIT_LIST_HEAD(&p->first_swap_extent.list);
+ INIT_LIST_HEAD(&p->list);
p->flags = SWP_USED;
- p->next = -1;
spin_unlock(&swap_lock);
spin_lock_init(&p->lock);


2014-10-07 23:33:02

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 15/37] jiffies: Fix timeval conversion to jiffies

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Andrew Hunter <[email protected]>

commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.

timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:

setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);

would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.) Doing
this repeatedly would cause unbounded growth in val. So fix the math.

Here's what was wrong with the conversion: we essentially computed
(eliding seconds)

jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x) >> USEC_JIFFIE_SC

and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)

In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies. This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
struct itimerval zero = {{0, 0}, {0, 0}};
/* Initially set to 10 ms. */
struct itimerval initial = zero;
initial.it_interval.tv_usec = 10000;
setitimer(ITIMER_PROF, &initial, NULL);
/* Save and restore several times. */
for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, &zero, &prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
prev.it_interval.tv_sec, prev.it_interval.tv_usec,
prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, &prev, NULL);
}
return 0;
}


Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Reported-by: Aaron Jacobs <[email protected]>
Signed-off-by: Andrew Hunter <[email protected]>
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz <[email protected]>
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/jiffies.h | 12 ----------
kernel/time.c | 54 ++++++++++++++++++++++++++----------------------
2 files changed, 30 insertions(+), 36 deletions(-)

--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
#define SEC_JIFFIE_SC (32 - SHIFT_HZ)
#endif
#define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
#define SEC_CONVERSION ((unsigned long)((((u64)NSEC_PER_SEC << SEC_JIFFIE_SC) +\
TICK_NSEC -1) / (u64)TICK_NSEC))

#define NSEC_CONVERSION ((unsigned long)((((u64)1 << NSEC_JIFFIE_SC) +\
TICK_NSEC -1) / (u64)TICK_NSEC))
-#define USEC_CONVERSION \
- ((unsigned long)((((u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\
- TICK_NSEC -1) / (u64)TICK_NSEC))
-/*
- * USEC_ROUND is used in the timeval to jiffie conversion. See there
- * for more details. It is the scaled resolution rounding value. Note
- * that it is a 64-bit value. Since, when it is applied, we are already
- * in jiffies (albit scaled), it is nothing but the bits we will shift
- * off.
- */
-#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1)
/*
* The maximum jiffie value is (MAX_INT >> 1). Here we translate that
* into seconds. The 64-bit case will overflow if we are not careful,
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
* that a remainder subtract here would not do the right thing as the
* resolution values don't fall on second boundries. I.e. the line:
* nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ * Note that due to the small error in the multiplier here, this
+ * rounding is incorrect for sufficiently large values of tv_nsec, but
+ * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're
+ * OK.
*
* Rather, we just shift the bits off the right.
*
* The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
* value to a scaled second value.
*/
-unsigned long
-timespec_to_jiffies(const struct timespec *value)
+static unsigned long
+__timespec_to_jiffies(unsigned long sec, long nsec)
{
- unsigned long sec = value->tv_sec;
- long nsec = value->tv_nsec + TICK_NSEC - 1;
+ nsec = nsec + TICK_NSEC - 1;

if (sec >= MAX_SEC_IN_JIFFIES){
sec = MAX_SEC_IN_JIFFIES;
@@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespe
(NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;

}
+
+unsigned long
+timespec_to_jiffies(const struct timespec *value)
+{
+ return __timespec_to_jiffies(value->tv_sec, value->tv_nsec);
+}
+
EXPORT_SYMBOL(timespec_to_jiffies);

void
@@ -533,31 +543,27 @@ jiffies_to_timespec(const unsigned long
}
EXPORT_SYMBOL(jiffies_to_timespec);

-/* Same for "timeval"
+/*
+ * We could use a similar algorithm to timespec_to_jiffies (with a
+ * different multiplier for usec instead of nsec). But this has a
+ * problem with rounding: we can't exactly add TICK_NSEC - 1 to the
+ * usec value, since it's not necessarily integral.
+ *
+ * We could instead round in the intermediate scaled representation
+ * (i.e. in units of 1/2^(large scale) jiffies) but that's also
+ * perilous: the scaling introduces a small positive error, which
+ * combined with a division-rounding-upward (i.e. adding 2^(scale) - 1
+ * units to the intermediate before shifting) leads to accidental
+ * overflow and overestimates.
*
- * Well, almost. The problem here is that the real system resolution is
- * in nanoseconds and the value being converted is in micro seconds.
- * Also for some machines (those that use HZ = 1024, in-particular),
- * there is a LARGE error in the tick size in microseconds.
-
- * The solution we use is to do the rounding AFTER we convert the
- * microsecond part. Thus the USEC_ROUND, the bits to be shifted off.
- * Instruction wise, this should cost only an additional add with carry
- * instruction above the way it was done above.
+ * At the cost of one additional multiplication by a constant, just
+ * use the timespec implementation.
*/
unsigned long
timeval_to_jiffies(const struct timeval *value)
{
- unsigned long sec = value->tv_sec;
- long usec = value->tv_usec;
-
- if (sec >= MAX_SEC_IN_JIFFIES){
- sec = MAX_SEC_IN_JIFFIES;
- usec = 0;
- }
- return (((u64)sec * SEC_CONVERSION) +
- (((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
- (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+ return __timespec_to_jiffies(value->tv_sec,
+ value->tv_usec * NSEC_PER_USEC);
}
EXPORT_SYMBOL(timeval_to_jiffies);


2014-10-07 23:32:59

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 16/37] mm: exclude memoryless nodes from zone_reclaim

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Michal Hocko <[email protected]>

commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.

We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out. In the end it turned out that the OOM was a
side effect of zone reclaim which wasn't unmapping and swapping out and
so the system was pushed to the OOM. Although this sounds like a bug
somewhere in the kswapd vs. zone reclaim vs. direct reclaim
interaction numactl on the said hardware suggests that the zone reclaim
should not have been set in the first place:

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node 0 2
0: 10 40
2: 40 10

So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory. Node distances cause an
automatic zone_reclaim_mode enabling.

Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memoryless nodes. So let's exclude such nodes for
init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.

Signed-off-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Nishanth Aravamudan <[email protected]>
Tested-by: Nishanth Aravamudan <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/page_alloc.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1869,7 +1869,7 @@ static void __paginginit init_zone_allow
{
int i;

- for_each_online_node(i)
+ for_each_node_state(i, N_MEMORY)
if (node_distance(nid, i) <= RECLAIM_DISTANCE)
node_set(i, NODE_DATA(nid)->reclaim_nodes);
else
@@ -4933,7 +4933,8 @@ void __paginginit free_area_init_node(in

pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- init_zone_allows_reclaim(nid);
+ if (node_state(nid, N_MEMORY))
+ init_zone_allows_reclaim(nid);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
#endif

2014-10-07 23:20:00

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 05/37] md/raid5: disable DISCARD by default due to safety concerns.

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: NeilBrown <[email protected]>

commit 8e0e99ba64c7ba46133a7c8a3e3f7de01f23bd93 upstream.

It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint. Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero. If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks. If all were zero, this would
be the case. If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Heinz Mauelshagen <[email protected]>
Acked-by: Martin K. Petersen <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/md/raid5.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -64,6 +64,10 @@
#define cpu_to_group(cpu) cpu_to_node(cpu)
#define ANY_GROUP NUMA_NO_NODE

+static bool devices_handle_discard_safely = false;
+module_param(devices_handle_discard_safely, bool, 0644);
+MODULE_PARM_DESC(devices_handle_discard_safely,
+ "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
static struct workqueue_struct *raid5_wq;
/*
* Stripe cache
@@ -6117,7 +6121,7 @@ static int run(struct mddev *mddev)
mddev->queue->limits.discard_granularity = stripe;
/*
* unaligned part of discard request will be ignored, so can't
- * guarantee discard_zerors_data
+ * guarantee discard_zeroes_data
*/
mddev->queue->limits.discard_zeroes_data = 0;

@@ -6142,6 +6146,18 @@ static int run(struct mddev *mddev)
!bdev_get_queue(rdev->bdev)->
limits.discard_zeroes_data)
discard_supported = false;
+ /* Unfortunately, discard_zeroes_data is not currently
+ * a guarantee - just a hint. So we only allow DISCARD
+ * if the sysadmin has confirmed that only safe devices
+ * are in use by setting a module parameter.
+ */
+ if (!devices_handle_discard_safely) {
+ if (discard_supported) {
+ pr_info("md/raid456: discard support disabled due to uncertainty.\n");
+ pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
+ }
+ discard_supported = false;
+ }
}

if (discard_supported &&

2014-10-07 23:33:52

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 08/37] init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Josh Triplett <[email protected]>

commit 62b4d2041117f35ab2409c9f5c4b8d3dc8e59d0f upstream.

commit 03b8c7b623c80af264c4c8d6111e5c6289933666 ("futex: Allow
architectures to skip futex_atomic_cmpxchg_inatomic() test") added the
HAVE_FUTEX_CMPXCHG symbol right below FUTEX. This placed it right in
the middle of the options for the EXPERT menu. However,
HAVE_FUTEX_CMPXCHG does not depend on EXPERT or FUTEX, so Kconfig stops
placing items in the EXPERT menu, and displays the remaining several
EXPERT items (starting with EPOLL) directly in the General Setup menu.

Since both users of HAVE_FUTEX_CMPXCHG only select it "if FUTEX", make
HAVE_FUTEX_CMPXCHG itself depend on FUTEX. With this change, the
subsequent items display as part of the EXPERT menu again; the EMBEDDED
menu now appears as the next top-level item in the General Setup menu,
which makes General Setup much shorter and more usable.

Signed-off-by: Josh Triplett <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
init/Kconfig | 1 +
1 file changed, 1 insertion(+)

--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1389,6 +1389,7 @@ config FUTEX

config HAVE_FUTEX_CMPXCHG
bool
+ depends on FUTEX
help
Architectures should select this if futex_atomic_cmpxchg_inatomic()
is implemented and always working. This removes a couple of runtime

2014-10-07 23:34:11

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 07/37] Fix problem recognizing symlinks

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Steve French <[email protected]>

commit 19e81573fca7b87ced7701e01ba164b968d929bd upstream.

Changeset eb85d94bd introduced a problem where if a cifs open
fails during query info of a file we
will still try to close the file (happens with certain types
of reparse points) even though the file handle is not valid.

In addition for SMB2/SMB3 we were not mapping the return code returned
by Windows when trying to open a file (like a Windows NFS symlink)
which is a reparse point.

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Pavel Shilovsky <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/smb1ops.c | 2 +-
fs/cifs/smb2maperror.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

--- a/fs/cifs/smb1ops.c
+++ b/fs/cifs/smb1ops.c
@@ -586,7 +586,7 @@ cifs_query_path_info(const unsigned int
tmprc = CIFS_open(xid, &oparms, &oplock, NULL);
if (tmprc == -EOPNOTSUPP)
*symlink = true;
- else
+ else if (tmprc == 0)
CIFSSMBClose(xid, tcon, fid.netfid);
}

--- a/fs/cifs/smb2maperror.c
+++ b/fs/cifs/smb2maperror.c
@@ -256,6 +256,8 @@ static const struct status_to_posix_erro
{STATUS_DLL_MIGHT_BE_INCOMPATIBLE, -EIO,
"STATUS_DLL_MIGHT_BE_INCOMPATIBLE"},
{STATUS_STOPPED_ON_SYMLINK, -EOPNOTSUPP, "STATUS_STOPPED_ON_SYMLINK"},
+ {STATUS_IO_REPARSE_TAG_NOT_HANDLED, -EOPNOTSUPP,
+ "STATUS_REPARSE_NOT_HANDLED"},
{STATUS_DEVICE_REQUIRES_CLEANING, -EIO,
"STATUS_DEVICE_REQUIRES_CLEANING"},
{STATUS_DEVICE_DOOR_OPEN, -EIO, "STATUS_DEVICE_DOOR_OPEN"},

2014-10-07 23:34:32

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 06/37] drm/i915: Flush the PTEs after updating them before suspend

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Chris Wilson <[email protected]>

commit 91e56499304f3d612053a9cf17f350868182c7d8 upstream.

As we use WC updates of the PTE, we are responsible for notifying the
hardware when to flush its TLBs. Do so after we zap all the PTEs before
suspend (and the BIOS tries to read our GTT).

Fixes a regression from

commit 828c79087cec61eaf4c76bb32c222fbe35ac3930
Author: Ben Widawsky <[email protected]>
Date: Wed Oct 16 09:21:30 2013 -0700

drm/i915: Disable GGTT PTEs on GEN6+ suspend

that survived and continue to cause harm even after

commit e568af1c626031925465a5caaab7cca1303d55c7
Author: Daniel Vetter <[email protected]>
Date: Wed Mar 26 20:08:20 2014 +0100

drm/i915: Undo gtt scratch pte unmapping again

v2: Trivial rebase.
v3: Fixes requires pointer dances.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=82340
Tested-by: [email protected]
Signed-off-by: Chris Wilson <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Paulo Zanoni <[email protected]>
Cc: Todd Previte <[email protected]>
Cc: Daniel Vetter <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Signed-off-by: Jani Nikula <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/gpu/drm/i915/i915_gem_gtt.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -827,6 +827,16 @@ void i915_check_and_clear_faults(struct
POSTING_READ(RING_FAULT_REG(&dev_priv->ring[RCS]));
}

+static void i915_ggtt_flush(struct drm_i915_private *dev_priv)
+{
+ if (INTEL_INFO(dev_priv->dev)->gen < 6) {
+ intel_gtt_chipset_flush();
+ } else {
+ I915_WRITE(GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
+ POSTING_READ(GFX_FLSH_CNTL_GEN6);
+ }
+}
+
void i915_gem_suspend_gtt_mappings(struct drm_device *dev)
{
struct drm_i915_private *dev_priv = dev->dev_private;
@@ -843,6 +853,8 @@ void i915_gem_suspend_gtt_mappings(struc
dev_priv->gtt.base.start / PAGE_SIZE,
dev_priv->gtt.base.total / PAGE_SIZE,
true);
+
+ i915_ggtt_flush(dev_priv);
}

void i915_gem_restore_gtt_mappings(struct drm_device *dev)
@@ -863,7 +875,7 @@ void i915_gem_restore_gtt_mappings(struc
i915_gem_gtt_bind_object(obj, obj->cache_level);
}

- i915_gem_chipset_flush(dev);
+ i915_ggtt_flush(dev_priv);
}

int i915_gem_gtt_prepare_object(struct drm_i915_gem_object *obj)

2014-10-07 23:34:47

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 04/37] cpufreq: integrator: fix integrator_cpufreq_remove return type

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Arnd Bergmann <[email protected]>

commit d62dbf77f7dfaa6fb455b4b9828069a11965929c upstream.

When building this driver as a module, we get a helpful warning
about the return type:

drivers/cpufreq/integrator-cpufreq.c:232:2: warning: initialization from incompatible pointer type
.remove = __exit_p(integrator_cpufreq_remove),

If the remove callback returns void, the caller gets an undefined
value as it expects an integer to be returned. This fixes the
problem by passing down the value from cpufreq_unregister_driver.

Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/cpufreq/integrator-cpufreq.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/cpufreq/integrator-cpufreq.c
+++ b/drivers/cpufreq/integrator-cpufreq.c
@@ -213,9 +213,9 @@ static int __init integrator_cpufreq_pro
return cpufreq_register_driver(&integrator_driver);
}

-static void __exit integrator_cpufreq_remove(struct platform_device *pdev)
+static int __exit integrator_cpufreq_remove(struct platform_device *pdev)
{
- cpufreq_unregister_driver(&integrator_driver);
+ return cpufreq_unregister_driver(&integrator_driver);
}

static const struct of_device_id integrator_cpufreq_match[] = {

2014-10-07 23:35:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 02/37] perf: fix perf bug in fork()

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Peter Zijlstra <[email protected]>

commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream.

Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
calling perf_event_free_task() when failing sched_fork() we will not yet
have done the memset() on ->perf_event_ctxp[] and will therefore try and
'free' the inherited contexts, which are still in use by the parent
process. This is bad..

Suggested-by: Oleg Nesterov <[email protected]>
Reported-by: Oleg Nesterov <[email protected]>
Reported-by: Sylvain 'ythier' Hitier <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
kernel/events/core.c | 4 +++-
kernel/fork.c | 5 +++--
2 files changed, 6 insertions(+), 3 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7836,8 +7836,10 @@ int perf_event_init_task(struct task_str

for_each_task_context_nr(ctxn) {
ret = perf_event_init_context(child, ctxn);
- if (ret)
+ if (ret) {
+ perf_event_free_task(child);
return ret;
+ }
}

return 0;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1323,7 +1323,7 @@ static struct task_struct *copy_process(
goto bad_fork_cleanup_policy;
retval = audit_alloc(p);
if (retval)
- goto bad_fork_cleanup_policy;
+ goto bad_fork_cleanup_perf;
/* copy all the process information */
retval = copy_semundo(clone_flags, p);
if (retval)
@@ -1522,8 +1522,9 @@ bad_fork_cleanup_semundo:
exit_sem(p);
bad_fork_cleanup_audit:
audit_free(p);
-bad_fork_cleanup_policy:
+bad_fork_cleanup_perf:
perf_event_free_task(p);
+bad_fork_cleanup_policy:
#ifdef CONFIG_NUMA
mpol_put(p->mempolicy);
bad_fork_cleanup_cgroup:

2014-10-07 23:35:03

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 03/37] mm: migrate: Close race between migration completion and mprotect

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Mel Gorman <[email protected]>

commit d3cb8bf6081b8b7a2dabb1264fe968fd870fa595 upstream.

A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.

[[email protected]: use maybe_mkwrite]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/migrate.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -148,8 +148,11 @@ static int remove_migration_pte(struct p
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
if (pte_swp_soft_dirty(*ptep))
pte = pte_mksoft_dirty(pte);
+
+ /* Recheck VMA as permissions can change since migration started */
if (is_write_migration_entry(entry))
- pte = pte_mkwrite(pte);
+ pte = maybe_mkwrite(pte, vma);
+
#ifdef CONFIG_HUGETLB_PAGE
if (PageHuge(new)) {
pte = pte_mkhuge(pte);

2014-10-07 23:36:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 3.14 01/37] udf: Avoid infinite loop when processing indirect ICBs

3.14-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jan Kara <[email protected]>

commit c03aa9f6e1f938618e6db2e23afef0574efeeb65 upstream.

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara <[email protected]>
Cc: Chuck Ebbert <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/udf/inode.c | 35 +++++++++++++++++++++--------------
1 file changed, 21 insertions(+), 14 deletions(-)

--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1271,13 +1271,22 @@ update_time:
return 0;
}

+/*
+ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
+ * arbitrary - just that we hopefully don't limit any real use of rewritten
+ * inode on write-once media but avoid looping for too long on corrupted media.
+ */
+#define UDF_MAX_ICB_NESTING 1024
+
static void __udf_read_inode(struct inode *inode)
{
struct buffer_head *bh = NULL;
struct fileEntry *fe;
uint16_t ident;
struct udf_inode_info *iinfo = UDF_I(inode);
+ unsigned int indirections = 0;

+reread:
/*
* Set defaults, but the inode is still incomplete!
* Note: get_new_inode() sets the following on a new inode:
@@ -1314,28 +1323,26 @@ static void __udf_read_inode(struct inod
ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
&ident);
if (ident == TAG_IDENT_IE && ibh) {
- struct buffer_head *nbh = NULL;
struct kernel_lb_addr loc;
struct indirectEntry *ie;

ie = (struct indirectEntry *)ibh->b_data;
loc = lelb_to_cpu(ie->indirectICB.extLocation);

- if (ie->indirectICB.extLength &&
- (nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
- &ident))) {
- if (ident == TAG_IDENT_FE ||
- ident == TAG_IDENT_EFE) {
- memcpy(&iinfo->i_location,
- &loc,
- sizeof(struct kernel_lb_addr));
- brelse(bh);
- brelse(ibh);
- brelse(nbh);
- __udf_read_inode(inode);
+ if (ie->indirectICB.extLength) {
+ brelse(bh);
+ brelse(ibh);
+ memcpy(&iinfo->i_location, &loc,
+ sizeof(struct kernel_lb_addr));
+ if (++indirections > UDF_MAX_ICB_NESTING) {
+ udf_err(inode->i_sb,
+ "too many ICBs in ICB hierarchy"
+ " (max %d supported)\n",
+ UDF_MAX_ICB_NESTING);
+ make_bad_inode(inode);
return;
}
- brelse(nbh);
+ goto reread;
}
}
brelse(ibh);

2014-10-08 02:48:14

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 3.14 00/37] 3.14.21-stable review

On 10/07/2014 04:19 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.14.21 release.
> There are 37 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Oct 9 23:18:16 UTC 2014.
> Anything received after that time might be too late.
>

Build results:
total: 137 pass: 137 fail: 0
Qemu tests:
total: 30 pass: 30 fail: 0

Details are available at http://server.roeck-us.net:8010/builders.

Guenter

2014-10-08 20:06:01

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH 3.14 00/37] 3.14.21-stable review

On 10/07/2014 05:19 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.14.21 release.
> There are 37 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Oct 9 23:18:16 UTC 2014.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.14.21-rc1.gz
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>

Compiled and booted on my test system. No dmesg regressions.

-- Shuah

--
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
[email protected] | (970) 217-8978