Changelog since V7
o Rebase to 3.3-rc2
o Take greater care propagating page->pfmemalloc to skb
o Propagate pfmemalloc from netdev_alloc_page to skb where possible
o Release RCU lock properly on preempt kernel
Changelog since V6
o Rebase to 3.1-rc8
o Use wake_up instead of wake_up_interruptible()
o Do not throttle kernel threads
o Avoid a potential race between kswapd going to sleep and processes being
throttled
Changelog since V5
o Rebase to 3.1-rc5
Changelog since V4
o Update comment clarifying what protocols can be used (Michal)
o Rebase to 3.0-rc3
Changelog since V3
o Propogate pfmemalloc from packet fragment pages to skb (Neil)
o Rebase to 3.0-rc2
Changelog since V2
o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC (Neil)
o Use wait_event_interruptible (Neil)
o Use !! when casting to bool to avoid any possibilitity of type
truncation (Neil)
o Nicer logic when using skb_pfmemalloc_protocol (Neil)
Changelog since V1
o Rebase on top of mmotm
o Use atomic_t for memalloc_socks (David Miller)
o Remove use of sk_memalloc_socks in vmscan (Neil Brown)
o Check throttle within prepare_to_wait (Neil Brown)
o Add statistics on throttling instead of printk
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like normal
block devices do. As the host cannot control where they receive packets from,
they cannot reliably work out in advance how much memory they might need.
Some years ago, Peter Ziljstra developed a series of patches that supported
swap over an NFS that some distributions are carrying in their kernels. This
patch series borrows very heavily from Peter's work to support swapping
over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the
complexity is concerned with preserving memory that is allocated from the
PFMEMALLOC reserves for use by the network layer which is needed for both
NBD and NFS.
Patch 1 serialises access to min_free_kbytes. It's not strictly needed
by this series but as the series cares about watermarks in
general, it's a harmless fix. It could be merged independently.
Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 6-11 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean
pages. If packets are received and stored in pages that were
allocated under low-memory situations and are unrelated to
the VM, the packets are dropped.
Patch 12 is a micro-optimisation to avoid a function call in the
common case.
Patch 13 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 14 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get
throttled on a waitqueue if 50% of the PFMEMALLOC reserves are
depleted. It is expected that kswapd and the direct reclaimers
already running will clean enough pages for the low watermark
to be reached and the throttled processes are woken up.
Patch 15 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure. Without the patches, the machine locks up within
minutes and runs to completion with them applied.
drivers/block/nbd.c | 6 +-
drivers/net/ethernet/chelsio/cxgb4/sge.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4vf/sge.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-
drivers/net/usb/cdc-phonet.c | 2 +-
drivers/usb/gadget/f_phonet.c | 2 +-
include/linux/gfp.h | 13 +-
include/linux/mm_types.h | 9 +
include/linux/mmzone.h | 1 +
include/linux/sched.h | 7 +
include/linux/skbuff.h | 66 ++++++-
include/linux/slub_def.h | 1 +
include/linux/vm_event_item.h | 1 +
include/net/sock.h | 19 ++
include/trace/events/gfpflags.h | 1 +
kernel/softirq.c | 3 +
mm/page_alloc.c | 57 ++++-
mm/slab.c | 235 ++++++++++++++++++---
mm/slub.c | 36 +++-
mm/vmscan.c | 72 +++++++
mm/vmstat.c | 1 +
net/core/dev.c | 52 ++++-
net/core/filter.c | 8 +
net/core/skbuff.c | 93 +++++++--
net/core/sock.c | 42 ++++
net/ipv4/tcp.c | 3 +-
net/ipv4/tcp_output.c | 16 +-
net/ipv6/tcp_ipv6.c | 12 +-
30 files changed, 675 insertions(+), 94 deletions(-)
--
1.7.3.4
Allow specific sockets to be tagged SOCK_MEMALLOC and use
__GFP_MEMALLOC for their allocations. These sockets will be able to go
below watermarks and allocate from the emergency reserve. Such sockets
are to be used to service the VM (iow. to swap over). They must be
handled kernel side, exposing such a socket to user-space is a bug.
There is a risk that the reserves be depleted so for now, the
administrator is responsible for increasing min_free_kbytes as
necessary to prevent deadlock for their workloads.
[[email protected]: Original patches]
Signed-off-by: Mel Gorman <[email protected]>
---
include/net/sock.h | 5 ++++-
net/core/sock.c | 22 ++++++++++++++++++++++
2 files changed, 26 insertions(+), 1 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index a76e858..82b2148 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -579,6 +579,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* VM depends on this socket for swapping */
SOCK_TIMESTAMPING_TX_HARDWARE, /* %SOF_TIMESTAMPING_TX_HARDWARE */
SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SOF_TIMESTAMPING_TX_SOFTWARE */
SOCK_TIMESTAMPING_RX_HARDWARE, /* %SOF_TIMESTAMPING_RX_HARDWARE */
@@ -614,7 +615,7 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
{
- return gfp_mask;
+ return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
}
static inline void sk_acceptq_removed(struct sock *sk)
@@ -755,6 +756,8 @@ extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
extern void sk_stream_wait_close(struct sock *sk, long timeo_p);
extern int sk_stream_error(struct sock *sk, int flags, int err);
extern void sk_stream_kill_queues(struct sock *sk);
+extern void sk_set_memalloc(struct sock *sk);
+extern void sk_clear_memalloc(struct sock *sk);
extern int sk_wait_data(struct sock *sk, long *timeo);
diff --git a/net/core/sock.c b/net/core/sock.c
index 3e81fd2..03069e0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -267,6 +267,28 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
EXPORT_SYMBOL(sysctl_optmem_max);
+/**
+ * sk_set_memalloc - sets %SOCK_MEMALLOC
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_MEMALLOC on a socket for access to emergency reserves.
+ * It's the responsibility of the admin to adjust min_free_kbytes
+ * to meet the requirements
+ */
+void sk_set_memalloc(struct sock *sk)
+{
+ sock_set_flag(sk, SOCK_MEMALLOC);
+ sk->sk_allocation |= __GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+void sk_clear_memalloc(struct sock *sk)
+{
+ sock_reset_flag(sk, SOCK_MEMALLOC);
+ sk->sk_allocation &= ~__GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+
#if defined(CONFIG_CGROUPS)
#if !defined(CONFIG_NET_CLS_CGROUP)
int net_cls_subsys_id = -1;
--
1.7.3.4
Getting and putting objects in SLAB currently requires a function call
but the bulk of the work is related to PFMEMALLOC reserves which are
only consumed when network-backed storage is critical. Use an inline
function to determine if the function call is required.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/slab.c | 28 ++++++++++++++++++++++++++--
1 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 95c96b9..268cd96 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -117,6 +117,8 @@
#include <linux/memory.h>
#include <linux/prefetch.h>
+#include <net/sock.h>
+
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
@@ -1001,7 +1003,7 @@ static void check_ac_pfmemalloc(struct kmem_cache *cachep,
ac->pfmemalloc = false;
}
-static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
gfp_t flags, bool force_refill)
{
int i;
@@ -1048,7 +1050,20 @@ static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
return objp;
}
-static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static inline void *ac_get_obj(struct kmem_cache *cachep,
+ struct array_cache *ac, gfp_t flags, bool force_refill)
+{
+ void *objp;
+
+ if (unlikely(sk_memalloc_socks()))
+ objp = __ac_get_obj(cachep, ac, flags, force_refill);
+ else
+ objp = ac->entry[--ac->avail];
+
+ return objp;
+}
+
+static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
void *objp)
{
struct slab *slabp;
@@ -1061,6 +1076,15 @@ static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
set_obj_pfmemalloc(&objp);
}
+ return objp;
+}
+
+static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ void *objp)
+{
+ if (unlikely(sk_memalloc_socks()))
+ objp = __ac_put_obj(cachep, ac, objp);
+
ac->entry[ac->avail++] = objp;
}
--
1.7.3.4
Under significant pressure when writing back to network-backed storage,
direct reclaimers may get throttled. This is expected to be a
short-lived event and the processes get woken up again but processes do
get stalled. This patch counts how many times such stalling occurs. It's
up to the administrator whether to reduce these stalls by increasing
min_free_kbytes.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/vmscan.c | 1 +
mm/vmstat.c | 1 +
3 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..652e5f3 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -29,6 +29,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGSTEAL),
FOR_ALL_ZONES(PGSCAN_KSWAPD),
FOR_ALL_ZONES(PGSCAN_DIRECT),
+ PGSCAN_DIRECT_THROTTLE,
#ifdef CONFIG_NUMA
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 87ae4314..7a43551 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2464,6 +2464,7 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
return;
/* Throttle */
+ count_vm_event(PGSCAN_DIRECT_THROTTLE);
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
pfmemalloc_watermark_ok(zone->zone_pgdat));
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f600557..0fff13d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -741,6 +741,7 @@ const char * const vmstat_text[] = {
TEXTS_FOR_ZONES("pgsteal")
TEXTS_FOR_ZONES("pgscan_kswapd")
TEXTS_FOR_ZONES("pgscan_direct")
+ "pgscan_direct_throttle",
#ifdef CONFIG_NUMA
"zone_reclaim_failed",
--
1.7.3.4
If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance. This patch will throttle direct
reclaimers if half the PF_MEMALLOC reserves are in use as the system
is at risk of hanging.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 73 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 650ba2f..eea0398 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -662,6 +662,7 @@ typedef struct pglist_data {
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
+ wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7c26962..01138a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4281,6 +4281,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
pgdat_resize_init(pgdat);
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat->kswapd_max_order = 0;
pgdat_page_cgroup_init(pgdat);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..87ae4314 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2425,6 +2425,49 @@ out:
return 0;
}
+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ unsigned long pfmemalloc_reserve = 0;
+ unsigned long free_pages = 0;
+ int i;
+
+ for (i = 0; i <= ZONE_NORMAL; i++) {
+ zone = &pgdat->node_zones[i];
+ pfmemalloc_reserve += min_wmark_pages(zone);
+ free_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ return (free_pages > pfmemalloc_reserve / 2) ? true : false;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+ nodemask_t *nodemask)
+{
+ struct zone *zone;
+ int high_zoneidx = gfp_zone(gfp_mask);
+ DEFINE_WAIT(wait);
+
+ /* Kernel threads such as kjournald should not be throttled */
+ if (current->flags & PF_KTHREAD)
+ return;
+
+ /* Check if the pfmemalloc reserves are ok */
+ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+ if (pfmemalloc_watermark_ok(zone->zone_pgdat))
+ return;
+
+ /* Throttle */
+ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(zone->zone_pgdat));
+}
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
@@ -2443,6 +2486,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.gfp_mask = sc.gfp_mask,
};
+ throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
+ /*
+ * Do not enter reclaim if fatal signal is pending. 1 is returned so
+ * that the page allocator does not consider triggering OOM
+ */
+ if (fatal_signal_pending(current))
+ return 1;
+
trace_mm_vmscan_direct_reclaim_begin(order,
sc.may_writepage,
gfp_mask);
@@ -2840,6 +2892,12 @@ loop_again:
}
}
+
+ /* Wake throttled direct reclaimers if low watermark is met */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+ pfmemalloc_watermark_ok(pgdat))
+ wake_up(&pgdat->pfmemalloc_wait);
+
if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
break; /* kswapd: all done */
/*
@@ -2961,6 +3019,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
/*
+ * There is a potential race between when kswapd checks it
+ * watermarks and a process gets throttled. There is also
+ * a potential race if processes get throttled, kswapd wakes,
+ * a large process exits therby balancing the zones that causes
+ * kswapd to miss a wakeup. If kswapd is going to sleep, no
+ * process should be sleeping on pfmemalloc_wait so wake them
+ * now if necessary. If necessary, processes will wake kswapd
+ * and get throttled again
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait))
+ wake_up(&pgdat->pfmemalloc_wait);
+
+ /*
* vmstat counters are not perfectly accurate and the estimated
* value for counters such as NR_FREE_PAGES can deviate from the
* true value by nr_online_cpus * threshold. To avoid the zone
--
1.7.3.4
The skb->pfmemalloc flag gets set to true iff during the slab
allocation of data in __alloc_skb that the the PFMEMALLOC reserves
were used. If page splitting is used, it is possible that pages will
be allocated from the PFMEMALLOC reserve without propagating this
information to the skb. This patch propagates page->pfmemalloc from
pages allocated for fragments to the skb.
It works by reintroducing and expanding the netdev_alloc_page() API
to take an skb. If the page was allocated from pfmemalloc reserves,
it is automatically copied. If the driver allocates the page before
the skb, it should call propagate_pfmemalloc_skb() after the skb is
allocated to ensure the flag is copied properly.
Failure to do so is not critical. The resulting driver may perform
slower if it is used for swap-over-NBD or swap-over-NFS but it should
not result in failure.
Signed-off-by: Mel Gorman <[email protected]>
---
drivers/net/ethernet/chelsio/cxgb4/sge.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4vf/sge.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-
drivers/net/usb/cdc-phonet.c | 2 +-
drivers/usb/gadget/f_phonet.c | 2 +-
include/linux/skbuff.h | 38 +++++++++++++++++++++
8 files changed, 46 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 2dae795..05f02b3 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -528,7 +528,7 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,
#endif
while (n--) {
- pg = alloc_page(gfp);
+ pg = __netdev_alloc_page(gfp, NULL);
if (unlikely(!pg)) {
q->alloc_failed++;
break;
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index 0bd585b..e8a372e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -653,7 +653,7 @@ static unsigned int refill_fl(struct adapter *adapter, struct sge_fl *fl,
alloc_small_pages:
while (n--) {
- page = alloc_page(gfp | __GFP_NOWARN | __GFP_COLD);
+ page = __netdev_alloc_page(gfp | __GFP_NOWARN, NULL);
if (unlikely(!page)) {
fl->alloc_failed++;
break;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index e91d73c..c062909 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6187,7 +6187,7 @@ static bool igb_alloc_mapped_page(struct igb_ring *rx_ring,
return true;
if (!page) {
- page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ page = __netdev_alloc_page(GFP_ATOMIC, bi->skb);
bi->page = page;
if (unlikely(!page)) {
rx_ring->rx_stats.alloc_failed++;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1ee5d0f..7a011c3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1143,7 +1143,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
if (ring_is_ps_enabled(rx_ring)) {
if (!bi->page) {
- bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ bi->page = __netdev_alloc_page(GFP_ATOMIC, skb);
if (!bi->page) {
rx_ring->rx_stats.alloc_rx_page_failed++;
goto no_buffers;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index bed411b..f6ea14a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -366,7 +366,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
if (!bi->page_dma &&
(adapter->flags & IXGBE_FLAG_RX_PS_ENABLED)) {
if (!bi->page) {
- bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ bi->page = __netdev_alloc_page(GFP_ATOMIC, NULL);
if (!bi->page) {
adapter->alloc_rx_page_failed++;
goto no_buffers;
@@ -400,6 +400,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
*/
skb_reserve(skb, NET_IP_ALIGN);
+ propagate_pfmemalloc_skb(bi->page_dma, skb);
bi->skb = skb;
}
if (!bi->dma) {
diff --git a/drivers/net/usb/cdc-phonet.c b/drivers/net/usb/cdc-phonet.c
index 790cbde..51c8b9e 100644
--- a/drivers/net/usb/cdc-phonet.c
+++ b/drivers/net/usb/cdc-phonet.c
@@ -130,7 +130,7 @@ static int rx_submit(struct usbpn_dev *pnd, struct urb *req, gfp_t gfp_flags)
struct page *page;
int err;
- page = alloc_page(gfp_flags);
+ page = __netdev_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
if (!page)
return -ENOMEM;
diff --git a/drivers/usb/gadget/f_phonet.c b/drivers/usb/gadget/f_phonet.c
index 7cdcb63..a5550dd 100644
--- a/drivers/usb/gadget/f_phonet.c
+++ b/drivers/usb/gadget/f_phonet.c
@@ -301,7 +301,7 @@ pn_rx_submit(struct f_phonet *fp, struct usb_request *req, gfp_t gfp_flags)
struct page *page;
int err;
- page = alloc_page(gfp_flags);
+ page = __netdev_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
if (!page)
return -ENOMEM;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17ed022..8da4ca0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1696,6 +1696,44 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
}
/**
+ * __netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
+ * @skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
+ *
+ * Allocate a new page. dev currently unused.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+static inline struct page *__netdev_alloc_page(gfp_t gfp_mask,
+ struct sk_buff *skb)
+{
+ struct page *page;
+
+ gfp_mask |= __GFP_COLD;
+
+ if (!(gfp_mask & __GFP_NOMEMALLOC))
+ gfp_mask |= __GFP_MEMALLOC;
+
+ page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
+ if (skb && page && page->pfmemalloc)
+ skb->pfmemalloc = true;
+
+ return page;
+}
+
+/**
+ * propagate_pfmemalloc_skb - Propagate pfmemalloc if skb is allocated after RX page
+ * @page: The page that was allocated from netdev_alloc_page
+ * @skb: The skb that may need pfmemalloc set
+ */
+static inline void propagate_pfmemalloc_skb(struct page *page,
+ struct sk_buff *skb)
+{
+ if (page && page->pfmemalloc)
+ skb->pfmemalloc = true;
+}
+
+/**
* skb_frag_page - retrieve the page refered to by a paged fragment
* @frag: the paged fragment
*
--
1.7.3.4
Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC
reserves so pages backed by NBD, particularly if swap related, can
be cleaned to prevent the machine being deadlocked. It is still
possible that the PFMEMALLOC reserves get depleted resulting in
deadlock but this can be resolved by the administrator by increasing
min_free_kbytes.
Signed-off-by: Mel Gorman <[email protected]>
---
drivers/block/nbd.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index c3f0ee1..45de594 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -155,6 +155,7 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
struct msghdr msg;
struct kvec iov;
sigset_t blocked, oldset;
+ unsigned long pflags = current->flags;
if (unlikely(!sock)) {
dev_err(disk_to_dev(lo->disk),
@@ -168,8 +169,9 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
siginitsetinv(&blocked, sigmask(SIGKILL));
sigprocmask(SIG_SETMASK, &blocked, &oldset);
+ current->flags |= PF_MEMALLOC;
do {
- sock->sk->sk_allocation = GFP_NOIO;
+ sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
iov.iov_base = buf;
iov.iov_len = size;
msg.msg_name = NULL;
@@ -215,6 +217,7 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
} while (size > 0);
sigprocmask(SIG_SETMASK, &oldset, NULL);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
return result;
}
@@ -405,6 +408,7 @@ static int nbd_do_it(struct nbd_device *lo)
BUG_ON(lo->magic != LO_MAGIC);
+ sk_set_memalloc(lo->sock->sk);
lo->pid = task_pid_nr(current);
ret = device_create_file(disk_to_dev(lo->disk), &pid_attr);
if (ret) {
--
1.7.3.4
Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation. It is only used on allocation
paths that may be required for writing pages back to network storage.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/net/sock.h | 5 +++++
net/ipv4/tcp.c | 3 ++-
net/ipv4/tcp_output.c | 16 +++++++++-------
net/ipv6/tcp_ipv6.c | 12 +++++++++---
4 files changed, 25 insertions(+), 11 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 91c1c8b..a76e858 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -612,6 +612,11 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
return test_bit(flag, &sk->sk_flags);
}
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+ return gfp_mask;
+}
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 06373b4..872f8ad 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -696,7 +696,8 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
/* The TCP header must be at least 32-bit aligned. */
size = ALIGN(size, 4);
- skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
+ skb = alloc_skb_fclone(size + sk->sk_prot->max_header,
+ sk_allocation(sk, gfp));
if (skb) {
if (sk_wmem_schedule(sk, skb->truesize)) {
/*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4ff3b6d..7720d57 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2343,7 +2343,7 @@ void tcp_send_fin(struct sock *sk)
/* Socket is locked, keep trying until memory is available. */
for (;;) {
skb = alloc_skb_fclone(MAX_TCP_HEADER,
- sk->sk_allocation);
+ sk_allocation(sk, sk->sk_allocation));
if (skb)
break;
yield();
@@ -2369,7 +2369,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority)
struct sk_buff *skb;
/* NOTE: No TCP options attached and we never retransmit this. */
- skb = alloc_skb(MAX_TCP_HEADER, priority);
+ skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
if (!skb) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
return;
@@ -2442,7 +2442,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
s_data_desired = cvp->s_data_desired;
- skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
+ skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1,
+ sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return NULL;
@@ -2632,7 +2633,8 @@ int tcp_connect(struct sock *sk)
tcp_connect_init(sk);
- buff = alloc_skb_fclone(MAX_TCP_HEADER + 15, sk->sk_allocation);
+ buff = alloc_skb_fclone(MAX_TCP_HEADER + 15,
+ sk_allocation(sk, sk->sk_allocation));
if (unlikely(buff == NULL))
return -ENOBUFS;
@@ -2738,7 +2740,7 @@ void tcp_send_ack(struct sock *sk)
* tcp_transmit_skb() will set the ownership to this
* sock.
*/
- buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+ buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL) {
inet_csk_schedule_ack(sk);
inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2753,7 +2755,7 @@ void tcp_send_ack(struct sock *sk)
/* Send it off, this clears delayed acks for us. */
TCP_SKB_CB(buff)->when = tcp_time_stamp;
- tcp_transmit_skb(sk, buff, 0, GFP_ATOMIC);
+ tcp_transmit_skb(sk, buff, 0, sk_allocation(sk, GFP_ATOMIC));
}
/* This routine sends a packet with an out of date sequence
@@ -2773,7 +2775,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
struct sk_buff *skb;
/* We don't queue it, tcp_transmit_skb() sets ownership. */
- skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+ skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return -1;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3edd05a..0086077 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -584,7 +584,8 @@ static int tcp_v6_md5_do_add(struct sock *sk, const struct in6_addr *peer,
} else {
/* reallocate new list if current one is full. */
if (!tp->md5sig_info) {
- tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
+ tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+ sk_allocation(sk, GFP_ATOMIC));
if (!tp->md5sig_info) {
kfree(newkey);
return -ENOMEM;
@@ -598,7 +599,8 @@ static int tcp_v6_md5_do_add(struct sock *sk, const struct in6_addr *peer,
}
if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
- (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
+ (tp->md5sig_info->entries6 + 1)),
+ sk_allocation(sk, GFP_ATOMIC));
if (!keys) {
kfree(newkey);
@@ -722,7 +724,8 @@ static int tcp_v6_parse_md5_keys (struct sock *sk, char __user *optval,
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_md5sig_info *p;
- p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+ p = kzalloc(sizeof(struct tcp_md5sig_info),
+ sk_allocation(sk, GFP_KERNEL));
if (!p)
return -ENOMEM;
@@ -1074,6 +1077,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
const struct tcphdr *th = tcp_hdr(skb);
u32 seq = 0, ack_seq = 0;
struct tcp_md5sig_key *key = NULL;
+ gfp_t gfp_mask = GFP_ATOMIC;
if (th->rst)
return;
@@ -1085,6 +1089,8 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
if (sk)
key = tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->saddr);
#endif
+ if (sk)
+ gfp_mask = sk_allocation(sk, gfp_mask);
if (th->ack)
seq = ntohl(th->ack_seq);
--
1.7.3.4
In order to make sure pfmemalloc packets receive all memory
needed to proceed, ensure processing of pfmemalloc SKBs happens
under PF_MEMALLOC. This is limited to a subset of protocols that
are expected to be used for writing to swap. Taps are not allowed to
use PF_MEMALLOC as these are expected to communicate with userspace
processes which could be paged out.
[[email protected]: Ideas taken from various patches]
[[email protected]: Lock imbalance fix]
Signed-off-by: Mel Gorman <[email protected]>
---
include/net/sock.h | 5 +++++
net/core/dev.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
net/core/sock.c | 16 ++++++++++++++++
3 files changed, 67 insertions(+), 6 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 81178fa..bb147fd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -697,8 +697,13 @@ static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *s
return 0;
}
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
{
+ if (skb_pfmemalloc(skb))
+ return __sk_backlog_rcv(sk, skb);
+
return sk->sk_backlog_rcv(sk, skb);
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 115dee1..e4b240a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3162,6 +3162,23 @@ void netdev_rx_handler_unregister(struct net_device *dev)
}
EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
+/*
+ * Limit the use of PFMEMALLOC reserves to those protocols that implement
+ * the special handling of PFMEMALLOC skbs.
+ */
+static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
+{
+ switch (skb->protocol) {
+ case __constant_htons(ETH_P_ARP):
+ case __constant_htons(ETH_P_IP):
+ case __constant_htons(ETH_P_IPV6):
+ case __constant_htons(ETH_P_8021Q):
+ return true;
+ default:
+ return false;
+ }
+}
+
static int __netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
@@ -3171,14 +3188,27 @@ static int __netif_receive_skb(struct sk_buff *skb)
bool deliver_exact = false;
int ret = NET_RX_DROP;
__be16 type;
+ unsigned long pflags = current->flags;
net_timestamp_check(!netdev_tstamp_prequeue, skb);
trace_netif_receive_skb(skb);
+ /*
+ * PFMEMALLOC skbs are special, they should
+ * - be delivered to SOCK_MEMALLOC sockets only
+ * - stay away from userspace
+ * - have bounded memory usage
+ *
+ * Use PF_MEMALLOC as this saves us from propagating the allocation
+ * context down to all allocation sites.
+ */
+ if (skb_pfmemalloc(skb))
+ current->flags |= PF_MEMALLOC;
+
/* if we've gotten here through NAPI, check netpoll */
if (netpoll_receive_skb(skb))
- return NET_RX_DROP;
+ goto out;
if (!skb->skb_iif)
skb->skb_iif = skb->dev->ifindex;
@@ -3199,7 +3229,7 @@ another_round:
if (skb->protocol == cpu_to_be16(ETH_P_8021Q)) {
skb = vlan_untag(skb);
if (unlikely(!skb))
- goto out;
+ goto unlock;
}
#ifdef CONFIG_NET_CLS_ACT
@@ -3209,6 +3239,9 @@ another_round:
}
#endif
+ if (skb_pfmemalloc(skb))
+ goto skip_taps;
+
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
@@ -3217,13 +3250,17 @@ another_round:
}
}
+skip_taps:
#ifdef CONFIG_NET_CLS_ACT
skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
if (!skb)
- goto out;
+ goto unlock;
ncls:
#endif
+ if (skb_pfmemalloc(skb) && !skb_pfmemalloc_protocol(skb))
+ goto drop;
+
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (vlan_tx_tag_present(skb)) {
if (pt_prev) {
@@ -3233,7 +3270,7 @@ ncls:
if (vlan_do_receive(&skb, !rx_handler))
goto another_round;
else if (unlikely(!skb))
- goto out;
+ goto unlock;
}
if (rx_handler) {
@@ -3243,7 +3280,7 @@ ncls:
}
switch (rx_handler(&skb)) {
case RX_HANDLER_CONSUMED:
- goto out;
+ goto unlock;
case RX_HANDLER_ANOTHER:
goto another_round;
case RX_HANDLER_EXACT:
@@ -3273,6 +3310,7 @@ ncls:
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+drop:
atomic_long_inc(&skb->dev->rx_dropped);
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
@@ -3281,8 +3319,10 @@ ncls:
ret = NET_RX_DROP;
}
-out:
+unlock:
rcu_read_unlock();
+out:
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
return ret;
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 85ba27b..0aebbde 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -293,6 +293,22 @@ void sk_clear_memalloc(struct sock *sk)
}
EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ int ret;
+ unsigned long pflags = current->flags;
+
+ /* these should have been dropped before queueing */
+ BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
+
+ current->flags |= PF_MEMALLOC;
+ ret = sk->sk_backlog_rcv(sk, skb);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
+
+ return ret;
+}
+EXPORT_SYMBOL(__sk_backlog_rcv);
+
#if defined(CONFIG_CGROUPS)
#if !defined(CONFIG_NET_CLS_CGROUP)
int net_cls_subsys_id = -1;
--
1.7.3.4
The skb->pfmemalloc flag gets set to true iff during the slab
allocation of data in __alloc_skb that the the PFMEMALLOC reserves
were used. If the packet is fragmented, it is possible that pages
will be allocated from the PFMEMALLOC reserve without propagating
this information to the skb. This patch propagates page->pfmemalloc
from pages allocated for fragments to the skb.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/skbuff.h | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ca05362..17ed022 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1199,6 +1199,17 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
{
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ /*
+ * Propagate page->pfmemalloc to the skb if we can. The problem is
+ * that not all callers have unique ownership of the page. If
+ * pfmemalloc is set, we check the mapping as a mapping implies
+ * page->index is set (index and pfmemalloc share space).
+ * If it's a valid mapping, we cannot use page->pfmemalloc but we
+ * do not lose pfmemalloc information as the pages would not be
+ * allocated using __GFP_MEMALLOC.
+ */
+ if (page->pfmemalloc && !page->mapping)
+ skb->pfmemalloc = true;
frag->page.p = page;
frag->page_offset = off;
skb_frag_size_set(frag, size);
--
1.7.3.4
Change the skb allocation API to indicate RX usage and use this to fall
back to the PFMEMALLOC reserve when needed. SKBs allocated from the
reserve are tagged in skb->pfmemalloc. If an SKB is allocated from
the reserve and the socket is later found to be unrelated to page
reclaim, the packet is dropped so that the memory remains available
for page reclaim. Network protocols are expected to recover from this
packet loss.
[[email protected]: Ideas taken from various patches]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 3 ++
include/linux/skbuff.h | 17 +++++++--
include/net/sock.h | 6 +++
mm/internal.h | 3 --
net/core/filter.c | 8 ++++
net/core/skbuff.c | 93 ++++++++++++++++++++++++++++++++++++++++-------
net/core/sock.c | 4 ++
7 files changed, 114 insertions(+), 20 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 94af4a2..83cd7b6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -385,6 +385,9 @@ void drain_local_pages(void *dummy);
*/
extern gfp_t gfp_allowed_mask;
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
extern void pm_restrict_gfp_mask(void);
extern void pm_restore_gfp_mask(void);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 50db9b0..ca05362 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -452,6 +452,7 @@ struct sk_buff {
#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2;
#endif
+ __u8 pfmemalloc:1;
__u8 ooo_okay:1;
__u8 l4_rxhash:1;
__u8 wifi_acked_valid:1;
@@ -492,6 +493,15 @@ struct sk_buff {
#include <asm/system.h>
+#define SKB_ALLOC_FCLONE 0x01
+#define SKB_ALLOC_RX 0x02
+
+/* Returns true if the skb was allocated from PFMEMALLOC reserves */
+static inline bool skb_pfmemalloc(struct sk_buff *skb)
+{
+ return unlikely(skb->pfmemalloc);
+}
+
/*
* skb might have a dst pointer attached, refcounted or not.
* _skb_refdst low order bit is set if refcount was _not_ taken
@@ -549,7 +559,7 @@ extern void kfree_skb(struct sk_buff *skb);
extern void consume_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+ gfp_t priority, int flags, int node);
extern struct sk_buff *build_skb(void *data);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
@@ -560,7 +570,7 @@ static inline struct sk_buff *alloc_skb(unsigned int size,
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
+ return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
}
extern void skb_recycle(struct sk_buff *skb);
@@ -1627,7 +1637,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
gfp_t gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ struct sk_buff *skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+ SKB_ALLOC_RX, NUMA_NO_NODE);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
diff --git a/include/net/sock.h b/include/net/sock.h
index 82b2148..81178fa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -613,6 +613,12 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
return test_bit(flag, &sk->sk_flags);
}
+extern atomic_t memalloc_socks;
+static inline int sk_memalloc_socks(void)
+{
+ return atomic_read(&memalloc_socks);
+}
+
static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
{
return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
diff --git a/mm/internal.h b/mm/internal.h
index bff60d8..2189af4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -239,9 +239,6 @@ static inline struct page *mem_map_next(struct page *iter,
#define __paginginit __init
#endif
-/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
-
/* Memory initialisation debug and verification */
enum mminit_level {
MMINIT_WARNING,
diff --git a/net/core/filter.c b/net/core/filter.c
index 5dea452..92f18fc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -80,6 +80,14 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
int err;
struct sk_filter *filter;
+ /*
+ * If the skb was allocated from pfmemalloc reserves, only
+ * allow SOCK_MEMALLOC sockets to use it as this socket is
+ * helping free memory
+ */
+ if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
+ return -ENOMEM;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index da0c97f..074052f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -147,6 +147,43 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
BUG();
}
+
+/*
+ * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
+ * the caller if emergency pfmemalloc reserves are being used. If it is and
+ * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
+ * may be used. Otherwise, the packet data may be discarded until enough
+ * memory is free
+ */
+#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
+ __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
+void *__kmalloc_reserve(size_t size, gfp_t flags, int node, unsigned long ip,
+ bool *pfmemalloc)
+{
+ void *obj;
+ bool ret_pfmemalloc = false;
+
+ /*
+ * Try a regular allocation, when that fails and we're not entitled
+ * to the reserves, fail.
+ */
+ obj = kmalloc_node_track_caller(size,
+ flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ node);
+ if (obj || !(gfp_pfmemalloc_allowed(flags)))
+ goto out;
+
+ /* Try again but now we are using pfmemalloc reserves */
+ ret_pfmemalloc = true;
+ obj = kmalloc_node_track_caller(size, flags, node);
+
+out:
+ if (pfmemalloc)
+ *pfmemalloc = ret_pfmemalloc;
+
+ return obj;
+}
+
/* Allocate a new skbuff. We do this ourselves so we can fill in a few
* 'private' fields and also do memory statistics to find all the
* [BEEP] leaks.
@@ -157,8 +194,10 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
* __alloc_skb - allocate a network buffer
* @size: size to allocate
* @gfp_mask: allocation mask
- * @fclone: allocate from fclone cache instead of head cache
- * and allocate a cloned (child) skb
+ * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
+ * instead of head cache and allocate a cloned (child) skb.
+ * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
+ * allocations in case the data is required for writeback
* @node: numa node to allocate memory on
*
* Allocate a new &sk_buff. The returned buffer has no headroom and a
@@ -169,14 +208,19 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
* %GFP_ATOMIC.
*/
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+ int flags, int node)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+ bool pfmemalloc;
+
+ cache = (flags & SKB_ALLOC_FCLONE)
+ ? skbuff_fclone_cache : skbuff_head_cache;
- cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+ if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
+ gfp_mask |= __GFP_MEMALLOC;
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
@@ -191,7 +235,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
*/
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
- data = kmalloc_node_track_caller(size, gfp_mask, node);
+ data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
if (!data)
goto nodata;
/* kmalloc(size) might give us more room than requested.
@@ -209,6 +253,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
memset(skb, 0, offsetof(struct sk_buff, tail));
/* Account for allocated memory : skb + skb->head */
skb->truesize = SKB_TRUESIZE(size);
+ skb->pfmemalloc = pfmemalloc;
atomic_set(&skb->users, 1);
skb->head = data;
skb->data = data;
@@ -224,7 +269,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
atomic_set(&shinfo->dataref, 1);
kmemcheck_annotate_variable(shinfo->destructor_arg);
- if (fclone) {
+ if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff *child = skb + 1;
atomic_t *fclone_ref = (atomic_t *) (child + 1);
@@ -234,6 +279,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
atomic_set(fclone_ref, 1);
child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->pfmemalloc = pfmemalloc;
}
out:
return skb;
@@ -311,7 +357,8 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
{
struct sk_buff *skb;
- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+ SKB_ALLOC_RX, NUMA_NO_NODE);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -605,6 +652,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
#if IS_ENABLED(CONFIG_IP_VS)
new->ipvs_property = old->ipvs_property;
#endif
+ new->pfmemalloc = old->pfmemalloc;
new->protocol = old->protocol;
new->mark = old->mark;
new->skb_iif = old->skb_iif;
@@ -763,6 +811,9 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
@@ -799,6 +850,13 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
}
+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+ if (skb_pfmemalloc((struct sk_buff *)skb))
+ return SKB_ALLOC_RX;
+ return 0;
+}
+
/**
* skb_copy - create private copy of an sk_buff
* @skb: buffer to copy
@@ -820,7 +878,8 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
{
int headerlen = skb_headroom(skb);
unsigned int size = (skb_end_pointer(skb) - skb->head) + skb->data_len;
- struct sk_buff *n = alloc_skb(size, gfp_mask);
+ struct sk_buff *n = __alloc_skb(size, gfp_mask,
+ skb_alloc_rx_flag(skb), NUMA_NO_NODE);
if (!n)
return NULL;
@@ -855,7 +914,8 @@ EXPORT_SYMBOL(skb_copy);
struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask)
{
unsigned int size = skb_headlen(skb) + headroom;
- struct sk_buff *n = alloc_skb(size, gfp_mask);
+ struct sk_buff *n = __alloc_skb(size, gfp_mask,
+ skb_alloc_rx_flag(skb), NUMA_NO_NODE);
if (!n)
goto out;
@@ -952,7 +1012,10 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
goto adjust_others;
}
- data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+ data = kmalloc_reserve(size + sizeof(struct skb_shared_info), gfp_mask,
+ NUMA_NO_NODE, NULL);
if (!data)
goto nodata;
@@ -1060,8 +1123,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
/*
* Allocate the copy buffer
*/
- struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
- gfp_mask);
+ struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+ gfp_mask, skb_alloc_rx_flag(skb),
+ NUMA_NO_NODE);
int oldheadroom = skb_headroom(skb);
int head_copy_len, head_copy_off;
int off;
@@ -2727,8 +2791,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
skb_release_head_state(nskb);
__skb_push(nskb, doffset);
} else {
- nskb = alloc_skb(hsize + doffset + headroom,
- GFP_ATOMIC);
+ nskb = __alloc_skb(hsize + doffset + headroom,
+ GFP_ATOMIC, skb_alloc_rx_flag(skb),
+ NUMA_NO_NODE);
if (unlikely(!nskb))
goto err;
diff --git a/net/core/sock.c b/net/core/sock.c
index 03069e0..85ba27b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -267,6 +267,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
EXPORT_SYMBOL(sysctl_optmem_max);
+atomic_t memalloc_socks __read_mostly;
+
/**
* sk_set_memalloc - sets %SOCK_MEMALLOC
* @sk: socket to set it on
@@ -279,6 +281,7 @@ void sk_set_memalloc(struct sock *sk)
{
sock_set_flag(sk, SOCK_MEMALLOC);
sk->sk_allocation |= __GFP_MEMALLOC;
+ atomic_inc(&memalloc_socks);
}
EXPORT_SYMBOL_GPL(sk_set_memalloc);
@@ -286,6 +289,7 @@ void sk_clear_memalloc(struct sock *sk)
{
sock_reset_flag(sk, SOCK_MEMALLOC);
sk->sk_allocation &= ~__GFP_MEMALLOC;
+ atomic_dec(&memalloc_socks);
}
EXPORT_SYMBOL_GPL(sk_clear_memalloc);
--
1.7.3.4
The reserve is proportionally distributed over all !highmem zones
in the system. So we need to allow an emergency allocation access to
all zones. In order to do that we need to break out of any mempolicy
boundaries we might have.
In my opinion that does not break mempolicies as those are user
oriented and not system oriented. That is, system allocations are
not guaranteed to be within mempolicy boundaries. For instance IRQs
do not even have a mempolicy.
So breaking out of mempolicy boundaries for 'rare' emergency
allocations, which are always system allocations (as opposed to user)
is ok.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b462585..7c26962 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2256,6 +2256,13 @@ rebalance:
/* Allocate without watermarks if the context allows */
if (alloc_flags & ALLOC_NO_WATERMARKS) {
+ /*
+ * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
+ * the allocation is high priority and these type of
+ * allocations are system rather than user orientated
+ */
+ zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
page = __alloc_pages_high_priority(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
--
1.7.3.4
There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement. Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.
[[email protected]: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 23 +++++++++++++++--------
1 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2186ec..8b3b8cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4932,14 +4932,7 @@ static void setup_per_zone_lowmem_reserve(void)
calculate_totalreserve_pages();
}
-/**
- * setup_per_zone_wmarks - called when min_free_kbytes changes
- * or when memory is hot-{added|removed}
- *
- * Ensures that the watermark[min,low,high] values for each zone are set
- * correctly with respect to min_free_kbytes.
- */
-void setup_per_zone_wmarks(void)
+static void __setup_per_zone_wmarks(void)
{
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
@@ -4994,6 +4987,20 @@ void setup_per_zone_wmarks(void)
calculate_totalreserve_pages();
}
+/**
+ * setup_per_zone_wmarks - called when min_free_kbytes changes
+ * or when memory is hot-{added|removed}
+ *
+ * Ensures that the watermark[min,low,high] values for each zone are set
+ * correctly with respect to min_free_kbytes.
+ */
+void setup_per_zone_wmarks(void)
+{
+ mutex_lock(&zonelists_mutex);
+ __setup_per_zone_wmarks();
+ mutex_unlock(&zonelists_mutex);
+}
+
/*
* The inactive anon list should be small enough that the VM never has to
* do too much work, but large enough that each inactive page has a chance
--
1.7.3.4
__GFP_MEMALLOC will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC. It allows one to pass along the memalloc state
in object related allocation flags as opposed to task related flags,
such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC
as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag
which is now enough to identify allocations related to page reclaim.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 10 ++++++++--
include/linux/mm_types.h | 2 +-
include/trace/events/gfpflags.h | 1 +
mm/page_alloc.c | 14 ++++++--------
mm/slab.c | 2 +-
5 files changed, 17 insertions(+), 12 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581e74b..94af4a2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -23,6 +23,7 @@ struct vm_area_struct;
#define ___GFP_REPEAT 0x400u
#define ___GFP_NOFAIL 0x800u
#define ___GFP_NORETRY 0x1000u
+#define ___GFP_MEMALLOC 0x2000u
#define ___GFP_COMP 0x4000u
#define ___GFP_ZERO 0x8000u
#define ___GFP_NOMEMALLOC 0x10000u
@@ -76,9 +77,14 @@ struct vm_area_struct;
#define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) /* See above */
#define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) /* See above */
#define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) /* See above */
+#define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
#define __GFP_COMP ((__force gfp_t)___GFP_COMP) /* Add compound page metadata */
#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) /* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves */
+#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
+ * This takes precedence over the
+ * __GFP_MEMALLOC flag if both are
+ * set
+ */
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
@@ -129,7 +135,7 @@ struct vm_area_struct;
/* Control page allocator reclaim behavior */
#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
- __GFP_NORETRY|__GFP_NOMEMALLOC)
+ __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
/* Control slab gfp mask during early boot */
#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56a465f..7718903 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -54,7 +54,7 @@ struct page {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* slub first free object */
bool pfmemalloc; /* If set by the page allocator,
- * ALLOC_PFMEMALLOC was set
+ * ALLOC_NO_WATERMARKS was set
* and the low watermark was not
* met implying that the system
* is under some pressure. The
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9fe3a366..d6fd8e5 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -30,6 +30,7 @@
{(unsigned long)__GFP_COMP, "GFP_COMP"}, \
{(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \
{(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \
+ {(unsigned long)__GFP_MEMALLOC, "GFP_MEMALLOC"}, \
{(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \
{(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \
{(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6a3fa1c..91a762d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1429,7 +1429,6 @@ failed:
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-#define ALLOC_PFMEMALLOC 0x80 /* Caller has PF_MEMALLOC set */
#ifdef CONFIG_FAIL_PAGE_ALLOC
@@ -2173,11 +2172,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
- if ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))) {
- alloc_flags |= ALLOC_PFMEMALLOC;
-
- if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (gfp_mask & __GFP_MEMALLOC)
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
alloc_flags |= ALLOC_NO_WATERMARKS;
}
@@ -2186,7 +2184,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
{
- return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+ return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
}
static inline struct page *
@@ -2381,7 +2379,7 @@ got_pg:
* steps that will free more memory. The caller should avoid the
* page being used for !PFMEMALLOC purposes.
*/
- page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+ page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
return page;
}
diff --git a/mm/slab.c b/mm/slab.c
index e6a1e3d..95c96b9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3050,7 +3050,7 @@ static int cache_grow(struct kmem_cache *cachep,
if (!slabp)
goto opps1;
- /* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+ /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
if (pfmemalloc) {
struct array_cache *ac = cpu_cache_get(cachep);
slabp->pfmemalloc = true;
--
1.7.3.4
This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.
Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.
Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. We basically borrow the task flags from whatever process
happens to be preempted by the softirq.
So we modify the gfp to alloc flags mapping to not exclude task flags
in softirq context, and modify the softirq code to save, clear and
restore the PF_MEMALLOC flag.
The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 7 +++++++
kernel/softirq.c | 3 +++
mm/page_alloc.c | 5 ++++-
3 files changed, 14 insertions(+), 1 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2234985..f000bd4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1888,6 +1888,13 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif
+static inline void tsk_restore_flags(struct task_struct *p,
+ unsigned long pflags, unsigned long mask)
+{
+ p->flags &= ~mask;
+ p->flags |= pflags & mask;
+}
+
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p,
const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 4eb3a0f..70abb53 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+ unsigned long pflags = current->flags;
+ current->flags &= ~PF_MEMALLOC;
pending = local_softirq_pending();
account_system_vtime(current);
@@ -265,6 +267,7 @@ restart:
account_system_vtime(current);
__local_bh_enable(SOFTIRQ_OFFSET);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
}
#ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a762d..b462585 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2175,7 +2175,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
if (gfp_mask & __GFP_MEMALLOC)
alloc_flags |= ALLOC_NO_WATERMARKS;
- else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+ else if (!in_irq() && (current->flags & PF_MEMALLOC))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ else if (!in_interrupt() &&
+ unlikely(test_thread_flag(TIF_MEMDIE)))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
--
1.7.3.4
Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.
Pages allocated from the reserve are returned with page->pfmemalloc
set and it is up to the caller to determine how the page should be
protected. SLAB restricts access to any page with page->pfmemalloc set
to callers which are known to able to access the PFMEMALLOC reserve. If
one is not available, an attempt is made to allocate a new page rather
than use a reserve. SLUB is a bit more relaxed in that it only records
if the current per-CPU page was allocated from PFMEMALLOC reserve and
uses another partial slab if the caller does not have the necessary
GFP or process flags. This was found to be sufficient in tests to
avoid hangs due to SLUB generally maintaining smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.
[[email protected]: Original implementation]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm_types.h | 9 ++
include/linux/slub_def.h | 1 +
mm/internal.h | 3 +
mm/page_alloc.c | 27 +++++-
mm/slab.c | 211 +++++++++++++++++++++++++++++++++++++++-------
mm/slub.c | 36 +++++++--
6 files changed, 244 insertions(+), 43 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc3062..56a465f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -53,6 +53,15 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* slub first free object */
+ bool pfmemalloc; /* If set by the page allocator,
+ * ALLOC_PFMEMALLOC was set
+ * and the low watermark was not
+ * met implying that the system
+ * is under some pressure. The
+ * caller should try ensure
+ * this page is only used to
+ * free other pages.
+ */
};
union {
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index a32bcfd..1d9ae40 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -46,6 +46,7 @@ struct kmem_cache_cpu {
struct page *page; /* The slab from which we are allocating */
struct page *partial; /* Partially allocated frozen slabs */
int node; /* The node of the page (or -1 for debug) */
+ bool pfmemalloc; /* Slab page had pfmemalloc set */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
diff --git a/mm/internal.h b/mm/internal.h
index 2189af4..bff60d8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -239,6 +239,9 @@ static inline struct page *mem_map_next(struct page *iter,
#define __paginginit __init
#endif
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
/* Memory initialisation debug and verification */
enum mminit_level {
MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b3b8cf..6a3fa1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -695,6 +695,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
trace_mm_page_free(page, order);
kmemcheck_free_shadow(page, order);
+ page->pfmemalloc = false;
if (PageAnon(page))
page->mapping = NULL;
for (i = 0; i < (1 << order); i++)
@@ -1221,6 +1222,7 @@ void free_hot_cold_page(struct page *page, int cold)
migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);
+ page->pfmemalloc = false;
local_irq_save(flags);
if (unlikely(wasMlocked))
free_page_mlock(page);
@@ -1427,6 +1429,7 @@ failed:
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC 0x80 /* Caller has PF_MEMALLOC set */
#ifdef CONFIG_FAIL_PAGE_ALLOC
@@ -2170,16 +2173,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
- if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_interrupt() &&
- ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
+ if ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))) {
+ alloc_flags |= ALLOC_PFMEMALLOC;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
alloc_flags |= ALLOC_NO_WATERMARKS;
}
return alloc_flags;
}
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+ return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2365,8 +2374,16 @@ nopage:
got_pg:
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
- return page;
+ /*
+ * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+ * been OOM killed. The expectation is that the caller is taking
+ * steps that will free more memory. The caller should avoid the
+ * page being used for !PFMEMALLOC purposes.
+ */
+ page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
+ return page;
}
/*
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..e6a1e3d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@
#include <trace/events/kmem.h>
+#include "internal.h"
+
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
* 0 for faster, smaller code (especially in the critical paths).
@@ -229,6 +231,7 @@ struct slab {
unsigned int inuse; /* num of objs active in slab */
kmem_bufctl_t free;
unsigned short nodeid;
+ bool pfmemalloc; /* Slab had pfmemalloc set */
};
struct slab_rcu __slab_cover_slab_rcu;
};
@@ -250,15 +253,37 @@ struct array_cache {
unsigned int avail;
unsigned int limit;
unsigned int batchcount;
- unsigned int touched;
+ bool touched;
+ bool pfmemalloc;
spinlock_t lock;
void *entry[]; /*
* Must have this definition in here for the proper
* alignment of array_cache. Also simplifies accessing
* the entries.
+ *
+ * Entries should not be directly dereferenced as
+ * entries belonging to slabs marked pfmemalloc will
+ * have the lower bits set SLAB_OBJ_PFMEMALLOC
*/
};
+#define SLAB_OBJ_PFMEMALLOC 1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+ return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+ *objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+ return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+ *objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
/*
* bootstrap: The caches do not work without cpuarrays anymore, but the
* cpuarrays are allocated from the generic caches...
@@ -945,12 +970,100 @@ static struct array_cache *alloc_arraycache(int node, int entries,
nc->avail = 0;
nc->limit = entries;
nc->batchcount = batchcount;
- nc->touched = 0;
+ nc->touched = false;
spin_lock_init(&nc->lock);
}
return nc;
}
+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
+ struct array_cache *ac)
+{
+ struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+ struct slab *slabp;
+
+ if (!ac->pfmemalloc)
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_full, list)
+ if (slabp->pfmemalloc)
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_partial, list)
+ if (slabp->pfmemalloc)
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_free, list)
+ if (slabp->pfmemalloc)
+ return;
+
+ ac->pfmemalloc = false;
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ gfp_t flags, bool force_refill)
+{
+ int i;
+ void *objp = ac->entry[--ac->avail];
+
+ /* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+ if (unlikely(is_obj_pfmemalloc(objp))) {
+ struct kmem_list3 *l3;
+
+ if (gfp_pfmemalloc_allowed(flags)) {
+ clear_obj_pfmemalloc(&objp);
+ return objp;
+ }
+
+ /* The caller cannot use PFMEMALLOC objects, find another one */
+ for (i = 1; i < ac->avail; i++) {
+ /* If a !PFMEMALLOC object is found, swap them */
+ if (!is_obj_pfmemalloc(ac->entry[i])) {
+ objp = ac->entry[i];
+ ac->entry[i] = ac->entry[ac->avail];
+ ac->entry[ac->avail] = objp;
+ return objp;
+ }
+ }
+
+ /*
+ * If there are empty slabs on the slabs_free list and we are
+ * being forced to refill the cache, mark this one !pfmemalloc.
+ */
+ l3 = cachep->nodelists[numa_mem_id()];
+ if (!list_empty(&l3->slabs_free) && force_refill) {
+ struct slab *slabp = virt_to_slab(objp);
+ slabp->pfmemalloc = false;
+ clear_obj_pfmemalloc(&objp);
+ check_ac_pfmemalloc(cachep, ac);
+ return objp;
+ }
+
+ /* No !PFMEMALLOC objects available */
+ ac->avail++;
+ objp = NULL;
+ }
+
+ return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ void *objp)
+{
+ struct slab *slabp;
+
+ /* If there are pfmemalloc slabs, check if the object is part of one */
+ if (unlikely(ac->pfmemalloc)) {
+ slabp = virt_to_slab(objp);
+
+ if (slabp->pfmemalloc)
+ set_obj_pfmemalloc(&objp);
+ }
+
+ ac->entry[ac->avail++] = objp;
+}
+
/*
* Transfer objects in one arraycache to another.
* Locking must be handled by the caller.
@@ -1127,7 +1240,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
STATS_INC_ACOVERFLOW(cachep);
__drain_alien_cache(cachep, alien, nodeid);
}
- alien->entry[alien->avail++] = objp;
+ ac_put_obj(cachep, alien, objp);
spin_unlock(&alien->lock);
} else {
spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1738,7 +1851,8 @@ __initcall(cpucache_init);
* did not request dmaable memory, we might get it, but that
* would be relatively rare and ignorable.
*/
-static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+ bool *pfmemalloc)
{
struct page *page;
int nr_pages;
@@ -1759,6 +1873,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
if (!page)
return NULL;
+ *pfmemalloc = page->pfmemalloc;
nr_pages = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
@@ -2191,7 +2306,7 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
cpu_cache_get(cachep)->avail = 0;
cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
cpu_cache_get(cachep)->batchcount = 1;
- cpu_cache_get(cachep)->touched = 0;
+ cpu_cache_get(cachep)->touched = false;
cachep->batchcount = 1;
cachep->limit = BOOT_CPUCACHE_ENTRIES;
return 0;
@@ -2749,6 +2864,7 @@ static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
slabp->s_mem = objp + colour_off;
slabp->nodeid = nodeid;
slabp->free = 0;
+ slabp->pfmemalloc = false;
return slabp;
}
@@ -2880,7 +2996,7 @@ static void slab_map_pages(struct kmem_cache *cache, struct slab *slab,
* kmem_cache_alloc() when there are no active objs left in a cache.
*/
static int cache_grow(struct kmem_cache *cachep,
- gfp_t flags, int nodeid, void *objp)
+ gfp_t flags, int nodeid, void *objp, bool pfmemalloc)
{
struct slab *slabp;
size_t offset;
@@ -2924,7 +3040,7 @@ static int cache_grow(struct kmem_cache *cachep,
* 'nodeid'.
*/
if (!objp)
- objp = kmem_getpages(cachep, local_flags, nodeid);
+ objp = kmem_getpages(cachep, local_flags, nodeid, &pfmemalloc);
if (!objp)
goto failed;
@@ -2934,6 +3050,13 @@ static int cache_grow(struct kmem_cache *cachep,
if (!slabp)
goto opps1;
+ /* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+ if (pfmemalloc) {
+ struct array_cache *ac = cpu_cache_get(cachep);
+ slabp->pfmemalloc = true;
+ ac->pfmemalloc = true;
+ }
+
slab_map_pages(cachep, slabp, objp);
cache_init_objs(cachep, slabp);
@@ -3071,16 +3194,19 @@ bad:
#define check_slabp(x,y) do { } while(0)
#endif
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+ bool force_refill)
{
int batchcount;
struct kmem_list3 *l3;
struct array_cache *ac;
int node;
-retry:
check_irq_off();
node = numa_mem_id();
+ if (unlikely(force_refill))
+ goto force_grow;
+retry:
ac = cpu_cache_get(cachep);
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3098,7 +3224,7 @@ retry:
/* See if we can refill from the shared array */
if (l3->shared && transfer_objects(ac, l3->shared, batchcount)) {
- l3->shared->touched = 1;
+ l3->shared->touched = true;
goto alloc_done;
}
@@ -3130,8 +3256,8 @@ retry:
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);
- ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
- node);
+ ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+ node));
}
check_slabp(cachep, slabp);
@@ -3150,18 +3276,22 @@ alloc_done:
if (unlikely(!ac->avail)) {
int x;
- x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
+force_grow:
+ x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL, false);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
- if (!x && ac->avail == 0) /* no objects in sight? abort */
+
+ /* no objects in sight? abort */
+ if (!x && (ac->avail == 0 || force_refill))
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
goto retry;
}
- ac->touched = 1;
- return ac->entry[--ac->avail];
+ ac->touched = true;
+
+ return ac_get_obj(cachep, ac, flags, force_refill);
}
static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3243,23 +3373,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
void *objp;
struct array_cache *ac;
+ bool force_refill = false;
check_irq_off();
ac = cpu_cache_get(cachep);
if (likely(ac->avail)) {
- STATS_INC_ALLOCHIT(cachep);
- ac->touched = 1;
- objp = ac->entry[--ac->avail];
- } else {
- STATS_INC_ALLOCMISS(cachep);
- objp = cache_alloc_refill(cachep, flags);
+ ac->touched = true;
+ objp = ac_get_obj(cachep, ac, flags, false);
+
/*
- * the 'ac' may be updated by cache_alloc_refill(),
- * and kmemleak_erase() requires its correct value.
+ * Allow for the possibility all avail objects are not allowed
+ * by the current flags
*/
- ac = cpu_cache_get(cachep);
+ if (objp) {
+ STATS_INC_ALLOCHIT(cachep);
+ goto out;
+ }
+ force_refill = true;
}
+
+ STATS_INC_ALLOCMISS(cachep);
+ objp = cache_alloc_refill(cachep, flags, force_refill);
+ /*
+ * the 'ac' may be updated by cache_alloc_refill(),
+ * and kmemleak_erase() requires its correct value.
+ */
+ ac = cpu_cache_get(cachep);
+
+out:
/*
* To avoid a false negative, if an object that is in one of the
* per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3312,6 +3454,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
enum zone_type high_zoneidx = gfp_zone(flags);
void *obj = NULL;
int nid;
+ bool pfmemalloc;
if (flags & __GFP_THISNODE)
return NULL;
@@ -3348,7 +3491,8 @@ retry:
if (local_flags & __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
- obj = kmem_getpages(cache, local_flags, numa_mem_id());
+ obj = kmem_getpages(cache, local_flags, numa_mem_id(),
+ &pfmemalloc);
if (local_flags & __GFP_WAIT)
local_irq_disable();
if (obj) {
@@ -3356,7 +3500,7 @@ retry:
* Insert into the appropriate per node queues
*/
nid = page_to_nid(virt_to_page(obj));
- if (cache_grow(cache, flags, nid, obj)) {
+ if (cache_grow(cache, flags, nid, obj, pfmemalloc)) {
obj = ____cache_alloc_node(cache,
flags | GFP_THISNODE, nid);
if (!obj)
@@ -3428,7 +3572,7 @@ retry:
must_grow:
spin_unlock(&l3->list_lock);
- x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
+ x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL, false);
if (x)
goto retry;
@@ -3578,9 +3722,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
struct kmem_list3 *l3;
for (i = 0; i < nr_objects; i++) {
- void *objp = objpp[i];
+ void *objp;
struct slab *slabp;
+ clear_obj_pfmemalloc(&objpp[i]);
+ objp = objpp[i];
+
slabp = virt_to_slab(objp);
l3 = cachep->nodelists[node];
list_del(&slabp->list);
@@ -3693,12 +3840,12 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
if (likely(ac->avail < ac->limit)) {
STATS_INC_FREEHIT(cachep);
- ac->entry[ac->avail++] = objp;
+ ac_put_obj(cachep, ac, objp);
return;
} else {
STATS_INC_FREEMISS(cachep);
cache_flusharray(cachep, ac);
- ac->entry[ac->avail++] = objp;
+ ac_put_obj(cachep, ac, objp);
}
}
@@ -4125,7 +4272,7 @@ static void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3,
if (!ac || !ac->avail)
return;
if (ac->touched && !force) {
- ac->touched = 0;
+ ac->touched = false;
} else {
spin_lock_irq(&l3->list_lock);
if (ac->avail) {
diff --git a/mm/slub.c b/mm/slub.c
index 4907563..ea04994 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -32,6 +32,8 @@
#include <trace/events/kmem.h>
+#include "internal.h"
+
/*
* Lock order:
* 1. slub_lock (Global Semaphore)
@@ -1347,7 +1349,8 @@ static void setup_object(struct kmem_cache *s, struct page *page,
s->ctor(object);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node,
+ bool *pfmemalloc)
{
struct page *page;
void *start;
@@ -1362,6 +1365,7 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
goto out;
inc_slabs_node(s, page_to_nid(page), page->objects);
+ *pfmemalloc = page->pfmemalloc;
page->slab = s;
page->flags |= 1 << PG_slab;
@@ -2104,7 +2108,8 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
{
void *object;
struct kmem_cache_cpu *c;
- struct page *page = new_slab(s, flags, node);
+ bool pfmemalloc;
+ struct page *page = new_slab(s, flags, node, &pfmemalloc);
if (page) {
c = __this_cpu_ptr(s->cpu_slab);
@@ -2121,6 +2126,7 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
stat(s, ALLOC_SLAB);
c->node = page_to_nid(page);
c->page = page;
+ c->pfmemalloc = pfmemalloc;
*pc = c;
} else
object = NULL;
@@ -2128,6 +2134,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
return object;
}
+static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
+{
+ if (unlikely(c->pfmemalloc))
+ return gfp_pfmemalloc_allowed(gfpflags);
+
+ return true;
+}
+
/*
* Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
* or deactivate the page.
@@ -2200,6 +2214,16 @@ redo:
goto new_slab;
}
+ /*
+ * By rights, we should be searching for a slab page that was
+ * PFMEMALLOC but right now, we are losing the pfmemalloc
+ * information when the page leaves the per-cpu allocator
+ */
+ if (unlikely(!pfmemalloc_match(c, gfpflags))) {
+ deactivate_slab(s, c);
+ goto new_slab;
+ }
+
/* must check again c->freelist in case of cpu migration or IRQ */
object = c->freelist;
if (object)
@@ -2236,7 +2260,6 @@ new_slab:
/* Then do expensive stuff like retrieving pages from the partial lists */
object = get_partial(s, gfpflags, node, c);
-
if (unlikely(!object)) {
object = new_slab_objects(s, gfpflags, node, &c);
@@ -2304,8 +2327,8 @@ redo:
barrier();
object = c->freelist;
- if (unlikely(!object || !node_match(c, node)))
-
+ if (unlikely(!object || !node_match(c, node) ||
+ !pfmemalloc_match(c, gfpflags)))
object = __slab_alloc(s, gfpflags, node, addr, c);
else {
@@ -2784,10 +2807,11 @@ static void early_kmem_cache_node_alloc(int node)
{
struct page *page;
struct kmem_cache_node *n;
+ bool pfmemalloc; /* Ignore this early in boot */
BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
+ page = new_slab(kmem_cache_node, GFP_NOWAIT, node, &pfmemalloc);
BUG_ON(!page);
if (page_to_nid(page) != node) {
--
1.7.3.4
On Tue, Feb 7, 2012 at 6:56 AM, Mel Gorman <[email protected]> wrote:
>
> The core issue is that network block devices do not use mempools like normal
> block devices do. As the host cannot control where they receive packets from,
> they cannot reliably work out in advance how much memory they might need.
>
>
> Patch 1 serialises access to min_free_kbytes. It's not strictly needed
> Â Â Â Â by this series but as the series cares about watermarks in
> Â Â Â Â general, it's a harmless fix. It could be merged independently.
>
>
Any light shed on tuning min_free_kbytes for every day work?
> Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
> Â Â Â Â preserve access to pages allocated under low memory situations
> Â Â Â Â to callers that are freeing memory.
>
> Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
> Â Â Â Â reserves without setting PFMEMALLOC.
>
> Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
> Â Â Â Â for later use by network packet processing.
>
> Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.
>
> Patches 6-11 allows network processing to use PFMEMALLOC reserves when
> Â Â Â Â the socket has been marked as being used by the VM to clean
> Â Â Â Â pages. If packets are received and stored in pages that were
> Â Â Â Â allocated under low-memory situations and are unrelated to
> Â Â Â Â the VM, the packets are dropped.
>
> Patch 12 is a micro-optimisation to avoid a function call in the
> Â Â Â Â common case.
>
> Patch 13 tags NBD sockets as being SOCK_MEMALLOC so they can use
> Â Â Â Â PFMEMALLOC if necessary.
>
If it is feasible to bypass hang by tuning min_mem_kbytes, things may
become simpler if NICs are also tagged. Sock buffers, pre-allocated if
necessary just after NICs are turned on, are not handed back to kmem
cache but queued on local lists which are maintained by NIC driver, based
the on the info of min_mem_kbytes or similar, for tagged NICs.
Upside is no changes in VM core. Downsides?
> Patch 14 notes that it is still possible for the PFMEMALLOC reserve
> Â Â Â Â to be depleted. To prevent this, direct reclaimers get
> Â Â Â Â throttled on a waitqueue if 50% of the PFMEMALLOC reserves are
> Â Â Â Â depleted. Â It is expected that kswapd and the direct reclaimers
> Â Â Â Â already running will clean enough pages for the low watermark
> Â Â Â Â to be reached and the throttled processes are woken up.
>
> Patch 15 adds a statistic to track how often processes get throttled
>
>
> For testing swap-over-NBD, a machine was booted with 2G of RAM with a
> swapfile backed by NBD. 8*NUM_CPU processes were started that create
> anonymous memory mappings and read them linearly in a loop. The total
> size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
> memory pressure. Without the patches, the machine locks up within
> minutes and runs to completion with them applied.
>
>
While testing, what happens if the network wire is plugged off over
three minutes?
Thanks
Hillf
On Tue, Feb 07, 2012 at 08:45:18PM +0800, Hillf Danton wrote:
> On Tue, Feb 7, 2012 at 6:56 AM, Mel Gorman <[email protected]> wrote:
> >
> > The core issue is that network block devices do not use mempools like normal
> > block devices do. As the host cannot control where they receive packets from,
> > they cannot reliably work out in advance how much memory they might need.
> >
> >
> > Patch 1 serialises access to min_free_kbytes. It's not strictly needed
> > ? ? ? ?by this series but as the series cares about watermarks in
> > ? ? ? ?general, it's a harmless fix. It could be merged independently.
> >
> >
> Any light shed on tuning min_free_kbytes for every day work?
>
For every day work, leave min_free_kbytes as the default.
>
> > Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
> > ? ? ? ?preserve access to pages allocated under low memory situations
> > ? ? ? ?to callers that are freeing memory.
> >
> > Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
> > ? ? ? ?reserves without setting PFMEMALLOC.
> >
> > Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
> > ? ? ? ?for later use by network packet processing.
> >
> > Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.
> >
> > Patches 6-11 allows network processing to use PFMEMALLOC reserves when
> > ? ? ? ?the socket has been marked as being used by the VM to clean
> > ? ? ? ?pages. If packets are received and stored in pages that were
> > ? ? ? ?allocated under low-memory situations and are unrelated to
> > ? ? ? ?the VM, the packets are dropped.
> >
> > Patch 12 is a micro-optimisation to avoid a function call in the
> > ? ? ? ?common case.
> >
> > Patch 13 tags NBD sockets as being SOCK_MEMALLOC so they can use
> > ? ? ? ?PFMEMALLOC if necessary.
> >
>
> If it is feasible to bypass hang by tuning min_mem_kbytes,
No. Increasing or descreasing min_free_kbytes changes the timing but it
will still hang.
> things may
> become simpler if NICs are also tagged.
That would mean making changes to every driver and they do not necessarily
know what higher level protocol like TCP they are transmitting. How is
that simpler? What is the benefit?
> Sock buffers, pre-allocated if
> necessary just after NICs are turned on, are not handed back to kmem
> cache but queued on local lists which are maintained by NIC driver, based
> the on the info of min_mem_kbytes or similar, for tagged NICs.
I think you are referring to doing something like SKB recycling within
the driver.
> Upside is no changes in VM core. Downsides?
>
That wouls indead requires driver-specific changes and new core
infrastructure to deal with SKB recycling spreading the complexity over a
wider range of code. If all the SKBs are in use for SOCK_MEMALLOC purposes
for whatever reason and more cannot be allocated, it will still hang. So
downsides are it would be equally if not more complex than this approach
that it may still hang.
> > Patch 14 notes that it is still possible for the PFMEMALLOC reserve
> > ? ? ? ?to be depleted. To prevent this, direct reclaimers get
> > ? ? ? ?throttled on a waitqueue if 50% of the PFMEMALLOC reserves are
> > ? ? ? ?depleted. ?It is expected that kswapd and the direct reclaimers
> > ? ? ? ?already running will clean enough pages for the low watermark
> > ? ? ? ?to be reached and the throttled processes are woken up.
> >
> > Patch 15 adds a statistic to track how often processes get throttled
> >
> >
> > For testing swap-over-NBD, a machine was booted with 2G of RAM with a
> > swapfile backed by NBD. 8*NUM_CPU processes were started that create
> > anonymous memory mappings and read them linearly in a loop. The total
> > size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
> > memory pressure. Without the patches, the machine locks up within
> > minutes and runs to completion with them applied.
> >
> >
>
> While testing, what happens if the network wire is plugged off over
> three minutes?
>
I didn't test the scenario and I don't have a test machine available
right now to try but it is up to the userspace NBD client to manage the
reconnection. It is also up to the admin to prevent the NBD client being
killed by something like the OOM killer and to have it mlocked to avoid
the NBD client itself being swapped. NFS is able to handle this in-kernel
but NBD may be more fragile.
--
Mel Gorman
SUSE Labs
On Mon, 6 Feb 2012, Mel Gorman wrote:
> Pages allocated from the reserve are returned with page->pfmemalloc
> set and it is up to the caller to determine how the page should be
> protected. SLAB restricts access to any page with page->pfmemalloc set
pfmemalloc sounds like a page flag. If you would use one then the
preservation of the flag by copying it elsewhere may not be necessary and
the patches would be less invasive. Also you would not need to extend
and modify many of the structures.
On 02/06/2012 02:56 PM, Mel Gorman wrote:
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index e91d73c..c062909 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -6187,7 +6187,7 @@ static bool igb_alloc_mapped_page(struct igb_ring *rx_ring,
> return true;
>
> if (!page) {
> - page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> + page = __netdev_alloc_page(GFP_ATOMIC, bi->skb);
> bi->page = page;
> if (unlikely(!page)) {
> rx_ring->rx_stats.alloc_failed++;
This takes care of the case where we are allocating the page, but what
about if we are reusing the page? For this driver it might work better
to hold of on doing the association between the page and skb either
somewhere after the skb and the page have both been allocated, or in the
igb_clean_rx_irq path where we will have both the page and the data
accessible.
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 1ee5d0f..7a011c3 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1143,7 +1143,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
>
> if (ring_is_ps_enabled(rx_ring)) {
> if (!bi->page) {
> - bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> + bi->page = __netdev_alloc_page(GFP_ATOMIC, skb);
> if (!bi->page) {
> rx_ring->rx_stats.alloc_rx_page_failed++;
> goto no_buffers;
Same thing for this driver.
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index bed411b..f6ea14a 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -366,7 +366,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
> if (!bi->page_dma &&
> (adapter->flags & IXGBE_FLAG_RX_PS_ENABLED)) {
> if (!bi->page) {
> - bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> + bi->page = __netdev_alloc_page(GFP_ATOMIC, NULL);
> if (!bi->page) {
> adapter->alloc_rx_page_failed++;
> goto no_buffers;
> @@ -400,6 +400,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
> */
> skb_reserve(skb, NET_IP_ALIGN);
>
> + propagate_pfmemalloc_skb(bi->page_dma, skb);
> bi->skb = skb;
> }
> if (!bi->dma) {
I am pretty sure this is incorrect. I believe you want bi->page, not
bi->page_dma. This one is closer though to what I had in mind for igb
and ixgbe in terms of making it so there is only one location that
generates the association.
Also a similar changes would be needed for the igbvf , e1000, and e1000e
drivers in the Intel tree.
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 17ed022..8da4ca0 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1696,6 +1696,44 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
> }
>
> /**
> + * __netdev_alloc_page - allocate a page for ps-rx on a specific device
> + * @gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
> + * @skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
> + *
> + * Allocate a new page. dev currently unused.
> + *
> + * %NULL is returned if there is no free memory.
> + */
> +static inline struct page *__netdev_alloc_page(gfp_t gfp_mask,
> + struct sk_buff *skb)
> +{
> + struct page *page;
> +
> + gfp_mask |= __GFP_COLD;
> +
> + if (!(gfp_mask & __GFP_NOMEMALLOC))
> + gfp_mask |= __GFP_MEMALLOC;
> +
> + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
> + if (skb && page && page->pfmemalloc)
> + skb->pfmemalloc = true;
> +
> + return page;
> +}
> +
> +/**
> + * propagate_pfmemalloc_skb - Propagate pfmemalloc if skb is allocated after RX page
> + * @page: The page that was allocated from netdev_alloc_page
> + * @skb: The skb that may need pfmemalloc set
> + */
> +static inline void propagate_pfmemalloc_skb(struct page *page,
> + struct sk_buff *skb)
> +{
> + if (page && page->pfmemalloc)
> + skb->pfmemalloc = true;
> +}
> +
> +/**
> * skb_frag_page - retrieve the page refered to by a paged fragment
> * @frag: the paged fragment
> *
Is this function even really needed? It seems like you already have
this covered in your earlier patches, specifically 9/15, which takes
care of associating the skb and the page pfmemalloc flags when you use
skb_fill_page_desc. It would be useful to narrow things down so that we
are associating this either at the allocation time or at the
fill_page_desc call instead of doing it at both.
Thanks,
Alex
On Tue, Feb 7, 2012 at 9:27 PM, Mel Gorman <[email protected]> wrote:
> On Tue, Feb 07, 2012 at 08:45:18PM +0800, Hillf Danton wrote:
>> If it is feasible to bypass hang by tuning min_mem_kbytes,
>
> No. Increasing or descreasing min_free_kbytes changes the timing but it
> will still hang.
>
>> things may
>> become simpler if NICs are also tagged.
>
> That would mean making changes to every driver and they do not necessarily
> know what higher level protocol like TCP they are transmitting. How is
> that simpler? What is the benefit?
>
The benefit is to avoid allocating sock buffer in softirq by recycling,
then the changes in VM core maybe less.
Thanks
Hillf
On Tue, Feb 07, 2012 at 10:27:56AM -0600, Christoph Lameter wrote:
> On Mon, 6 Feb 2012, Mel Gorman wrote:
>
> > Pages allocated from the reserve are returned with page->pfmemalloc
> > set and it is up to the caller to determine how the page should be
> > protected. SLAB restricts access to any page with page->pfmemalloc set
>
> pfmemalloc sounds like a page flag. If you would use one then the
> preservation of the flag by copying it elsewhere may not be necessary and
> the patches would be less invasive.
Using a page flag would simplify parts of the patch. The catch of course
is that it requires a page flag which are in tight supply and I do not
want to tie this to being 32-bit unnecessarily.
> Also you would not need to extend
> and modify many of the structures.
>
Lets see;
o struct page size would be unaffected
o struct kmem_cache_cpu could be left alone even though it's a small saving
o struct slab also be left alone
o struct array_cache could be left alone although I would point out that
it would make no difference in size as touched is changed to a bool to
fit pfmemalloc in
o It would still be necessary to do the object pointer tricks in slab.c
to avoid doing an excessive number of page lookups which is where much
of the complexity is
o The virt_to_slab could be replaced by looking up the page flag instead
and avoiding a level of indirection that would be pleasing
to an int and placed with struct kmem_cache
I agree that parts of the patch would be simplier although the
complexity of storing pfmemalloc within the obj pointer would probably
remain. However, the downside of requiring a page flag is very high. In
the event we increase the number of page flags - great, I'll use one but
right now I do not think the use of page flag is justified.
--
Mel Gorman
SUSE Labs
On Wed, 8 Feb 2012, Mel Gorman wrote:
> o struct kmem_cache_cpu could be left alone even though it's a small saving
Its multiplied by the number of caches and by the number of
processors.
> o struct slab also be left alone
> o struct array_cache could be left alone although I would point out that
> it would make no difference in size as touched is changed to a bool to
> fit pfmemalloc in
Both of these are performance critical structures in slab.
> o It would still be necessary to do the object pointer tricks in slab.c
These trick are not done for slub. It seems that they are not necessary?
> remain. However, the downside of requiring a page flag is very high. In
> the event we increase the number of page flags - great, I'll use one but
> right now I do not think the use of page flag is justified.
On 64 bit I think there is not much of an issue with another page flag.
Also consider that the slab allocators do not make full use of the other
page flags. We could overload one of the existing flags. I removed
slubs use of them last year. PG_active could be overloaded I think.
On Tue, Feb 07, 2012 at 03:38:44PM -0800, Alexander Duyck wrote:
> On 02/06/2012 02:56 PM, Mel Gorman wrote:
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> > index e91d73c..c062909 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -6187,7 +6187,7 @@ static bool igb_alloc_mapped_page(struct igb_ring *rx_ring,
> > return true;
> >
> > if (!page) {
> > - page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> > + page = __netdev_alloc_page(GFP_ATOMIC, bi->skb);
> > bi->page = page;
> > if (unlikely(!page)) {
> > rx_ring->rx_stats.alloc_failed++;
>
> This takes care of the case where we are allocating the page, but what
> about if we are reusing the page?
Then nothing... You're right, I did not consider that case.
> For this driver it might work better
> to hold of on doing the association between the page and skb either
> somewhere after the skb and the page have both been allocated, or in the
> igb_clean_rx_irq path where we will have both the page and the data
> accessible.
Again, from looking through the code you appear to be right. Thanks for
the suggestion!
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 1ee5d0f..7a011c3 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -1143,7 +1143,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
> >
> > if (ring_is_ps_enabled(rx_ring)) {
> > if (!bi->page) {
> > - bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> > + bi->page = __netdev_alloc_page(GFP_ATOMIC, skb);
> > if (!bi->page) {
> > rx_ring->rx_stats.alloc_rx_page_failed++;
> > goto no_buffers;
>
> Same thing for this driver.
> > diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > index bed411b..f6ea14a 100644
> > --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > @@ -366,7 +366,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
> > if (!bi->page_dma &&
> > (adapter->flags & IXGBE_FLAG_RX_PS_ENABLED)) {
> > if (!bi->page) {
> > - bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> > + bi->page = __netdev_alloc_page(GFP_ATOMIC, NULL);
> > if (!bi->page) {
> > adapter->alloc_rx_page_failed++;
> > goto no_buffers;
> > @@ -400,6 +400,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
> > */
> > skb_reserve(skb, NET_IP_ALIGN);
> >
> > + propagate_pfmemalloc_skb(bi->page_dma, skb);
> > bi->skb = skb;
> > }
> > if (!bi->dma) {
>
> I am pretty sure this is incorrect. I believe you want bi->page, not
> bi->page_dma. This one is closer though to what I had in mind for igb
> and ixgbe in terms of making it so there is only one location that
> generates the association.
>
You are on a roll of being right.
> Also a similar changes would be needed for the igbvf , e1000, and e1000e
> drivers in the Intel tree.
>
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 17ed022..8da4ca0 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1696,6 +1696,44 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
> > }
> >
> > /**
> > + * __netdev_alloc_page - allocate a page for ps-rx on a specific device
> > + * @gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
> > + * @skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
> > + *
> > + * Allocate a new page. dev currently unused.
> > + *
> > + * %NULL is returned if there is no free memory.
> > + */
> > +static inline struct page *__netdev_alloc_page(gfp_t gfp_mask,
> > + struct sk_buff *skb)
> > +{
> > + struct page *page;
> > +
> > + gfp_mask |= __GFP_COLD;
> > +
> > + if (!(gfp_mask & __GFP_NOMEMALLOC))
> > + gfp_mask |= __GFP_MEMALLOC;
> > +
> > + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
> > + if (skb && page && page->pfmemalloc)
> > + skb->pfmemalloc = true;
> > +
> > + return page;
> > +}
> > +
> > +/**
> > + * propagate_pfmemalloc_skb - Propagate pfmemalloc if skb is allocated after RX page
> > + * @page: The page that was allocated from netdev_alloc_page
> > + * @skb: The skb that may need pfmemalloc set
> > + */
> > +static inline void propagate_pfmemalloc_skb(struct page *page,
> > + struct sk_buff *skb)
> > +{
> > + if (page && page->pfmemalloc)
> > + skb->pfmemalloc = true;
> > +}
> > +
> > +/**
> > * skb_frag_page - retrieve the page refered to by a paged fragment
> > * @frag: the paged fragment
> > *
>
> Is this function even really needed?
It's not *really* needed. As noted in the changelog, getting this wrong
has minor consequences. At worst, swap becomes a little slower but it
should not result in hangs.
> It seems like you already have
> this covered in your earlier patches, specifically 9/15, which takes
> care of associating the skb and the page pfmemalloc flags when you use
> skb_fill_page_desc.
Yes, this patch was an attempt to being thorough but the actual impact
is moving a bunch of complexity into drivers where it is difficult to
test and of marginal benefit.
> It would be useful to narrow things down so that we
> are associating this either at the allocation time or at the
> fill_page_desc call instead of doing it at both.
>
I think you're right. I'm going to drop this patch entirely as the
benfit is marginal and not necessary for swap over network to work.
Thanks very much for the review.
--
Mel Gorman
SUSE Labs
On Wed, Feb 08, 2012 at 08:51:11PM +0800, Hillf Danton wrote:
> On Tue, Feb 7, 2012 at 9:27 PM, Mel Gorman <[email protected]> wrote:
> > On Tue, Feb 07, 2012 at 08:45:18PM +0800, Hillf Danton wrote:
> >> If it is feasible to bypass hang by tuning min_mem_kbytes,
> >
> > No. Increasing or descreasing min_free_kbytes changes the timing but it
> > will still hang.
> >
> >> things may
> >> become simpler if NICs are also tagged.
> >
> > That would mean making changes to every driver and they do not necessarily
> > know what higher level protocol like TCP they are transmitting. How is
> > that simpler? What is the benefit?
> >
> The benefit is to avoid allocating sock buffer in softirq by recycling,
> then the changes in VM core maybe less.
>
The VM is responsible for swapping. It's reasonable that the core
VM has responsibility for it without trying to shove complexity into
drivers or elsewhere unnecessarily. I see some benefit in following on
by recycling some skbs and only allocating from softirq if no recycled
skbs are available. That potentially improves performance but I do not
recycling as a replacement.
--
Mel Gorman
SUSE Labs
On Wed, Feb 08, 2012 at 09:14:32AM -0600, Christoph Lameter wrote:
> On Wed, 8 Feb 2012, Mel Gorman wrote:
>
> > o struct kmem_cache_cpu could be left alone even though it's a small saving
>
> Its multiplied by the number of caches and by the number of
> processors.
>
> > o struct slab also be left alone
> > o struct array_cache could be left alone although I would point out that
> > it would make no difference in size as touched is changed to a bool to
> > fit pfmemalloc in
>
> Both of these are performance critical structures in slab.
>
Ok, I looked into what is necessary to replace these with checking a page
flag and the cost shifts quite a bit and ends up being more expensive.
Right now, I use array_cache to record if there are any pfmemalloc
objects in the free list at all. If there are not, no expensive checks
are made. For example, in __ac_put_obj(), I check ac->pfmemalloc to see
if an expensive check is required. Using a page flag, the same check
requires a lookup with virt_to_page(). This in turns uses a
pfn_to_page() which depending on the memory model can be very expensive.
No matter what, it's more expensive than a simple check and this is in
the slab free path.
It is more complicated in check_ac_pfmemalloc() too although the performance
impact is less because it is a slow path. If ac->pfmemalloc is false,
the check of each slabp can be avoided. Without it, all the slabps must
be checked unconditionally and each slabp that is checked must call
virt_to_page().
Overall, the memory savings of moving to a page flag are miniscule but
the performance cost is far higher because of the use of virt_to_page().
> > o It would still be necessary to do the object pointer tricks in slab.c
>
> These trick are not done for slub. It seems that they are not necessary?
>
In slub, it's sufficient to check kmem_cache_cpu to know whether the
objects in the list are pfmemalloc or not.
> > remain. However, the downside of requiring a page flag is very high. In
> > the event we increase the number of page flags - great, I'll use one but
> > right now I do not think the use of page flag is justified.
>
> On 64 bit I think there is not much of an issue with another page flag.
>
There isn't, but on 32 bit there is.
> Also consider that the slab allocators do not make full use of the other
> page flags. We could overload one of the existing flags. I removed
> slubs use of them last year. PG_active could be overloaded I think.
>
Yeah, you're right on the button there. I did my checking assuming that
PG_active+PG_slab were safe to use. The following is an untested patch that
I probably got details wrong in but it illustrates where virt_to_page()
starts cropping up.
It was a good idea and thanks for thinking of it but unfortunately the
implementation would be more expensive than what I have currently.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e90a673..108f3ce 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -432,6 +432,35 @@ static inline int PageTransCompound(struct page *page)
}
#endif
+#ifdef CONFIG_NFS_SWAP
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ SetPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ ClearPageActive(page);
+}
+#else
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+ return 0;
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+}
+#endif
+
#ifdef CONFIG_MMU
#define __PG_MLOCKED (1 << PG_mlocked)
#else
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 1d9ae40..a32bcfd 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -46,7 +46,6 @@ struct kmem_cache_cpu {
struct page *page; /* The slab from which we are allocating */
struct page *partial; /* Partially allocated frozen slabs */
int node; /* The node of the page (or -1 for debug) */
- bool pfmemalloc; /* Slab page had pfmemalloc set */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
diff --git a/mm/slab.c b/mm/slab.c
index 268cd96..3012186 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,7 +233,6 @@ struct slab {
unsigned int inuse; /* num of objs active in slab */
kmem_bufctl_t free;
unsigned short nodeid;
- bool pfmemalloc; /* Slab had pfmemalloc set */
};
struct slab_rcu __slab_cover_slab_rcu;
};
@@ -255,8 +254,7 @@ struct array_cache {
unsigned int avail;
unsigned int limit;
unsigned int batchcount;
- bool touched;
- bool pfmemalloc;
+ unsigned int touched;
spinlock_t lock;
void *entry[]; /*
* Must have this definition in here for the proper
@@ -978,6 +976,13 @@ static struct array_cache *alloc_arraycache(int node, int entries,
return nc;
}
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+ struct page *page = virt_to_page(slabp->s_mem);
+
+ return PageSlabPfmemalloc(page);
+}
+
/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
static void check_ac_pfmemalloc(struct kmem_cache *cachep,
struct array_cache *ac)
@@ -985,22 +990,18 @@ static void check_ac_pfmemalloc(struct kmem_cache *cachep,
struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
struct slab *slabp;
- if (!ac->pfmemalloc)
- return;
-
list_for_each_entry(slabp, &l3->slabs_full, list)
- if (slabp->pfmemalloc)
+ if (is_slab_pfmemalloc(slabp))
return;
list_for_each_entry(slabp, &l3->slabs_partial, list)
- if (slabp->pfmemalloc)
+ if (is_slab_pfmemalloc(slabp))
return;
list_for_each_entry(slabp, &l3->slabs_free, list)
- if (slabp->pfmemalloc)
+ if (is_slab_pfmemalloc(slabp))
return;
- ac->pfmemalloc = false;
}
static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
@@ -1036,7 +1037,7 @@ static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
l3 = cachep->nodelists[numa_mem_id()];
if (!list_empty(&l3->slabs_free) && force_refill) {
struct slab *slabp = virt_to_slab(objp);
- slabp->pfmemalloc = false;
+ ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
clear_obj_pfmemalloc(&objp);
check_ac_pfmemalloc(cachep, ac);
return objp;
@@ -1066,15 +1067,11 @@ static inline void *ac_get_obj(struct kmem_cache *cachep,
static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
void *objp)
{
- struct slab *slabp;
+ struct page *page = virt_to_page(objp);
/* If there are pfmemalloc slabs, check if the object is part of one */
- if (unlikely(ac->pfmemalloc)) {
- slabp = virt_to_slab(objp);
-
- if (slabp->pfmemalloc)
- set_obj_pfmemalloc(&objp);
- }
+ if (PageSlabPfmemalloc(page))
+ set_obj_pfmemalloc(&objp);
return objp;
}
@@ -1906,9 +1903,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
else
add_zone_page_state(page_zone(page),
NR_SLAB_UNRECLAIMABLE, nr_pages);
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
__SetPageSlab(page + i);
+ if (*pfmemalloc)
+ SetPageSlabPfmemalloc(page);
+ }
+
if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
@@ -2888,7 +2889,6 @@ static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
slabp->s_mem = objp + colour_off;
slabp->nodeid = nodeid;
slabp->free = 0;
- slabp->pfmemalloc = false;
return slabp;
}
@@ -3074,13 +3074,6 @@ static int cache_grow(struct kmem_cache *cachep,
if (!slabp)
goto opps1;
- /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
- if (pfmemalloc) {
- struct array_cache *ac = cpu_cache_get(cachep);
- slabp->pfmemalloc = true;
- ac->pfmemalloc = true;
- }
-
slab_map_pages(cachep, slabp, objp);
cache_init_objs(cachep, slabp);
On 02/06/2012 05:56 PM, Mel Gorman wrote:
> There is a race between the min_free_kbytes sysctl, memory hotplug
> and transparent hugepage support enablement. Memory hotplug uses a
> zonelists_mutex to avoid a race when building zonelists. Reuse it to
> serialise watermark updates.
>
> [[email protected]: Older patch fixed the race with spinlock]
> Signed-off-by: Mel Gorman<[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
On Wed, 8 Feb 2012, Mel Gorman wrote:
> Ok, I looked into what is necessary to replace these with checking a page
> flag and the cost shifts quite a bit and ends up being more expensive.
That is only true if you go the slab route. Slab suffers from not having
the page struct pointer readily available. The changes are likely already
impacting slab performance without the virt_to_page patch.
> In slub, it's sufficient to check kmem_cache_cpu to know whether the
> objects in the list are pfmemalloc or not.
We try to minimize the size of kmem_cache_cpu. The page pointer is readily
available. We just removed the node field from kmem_cache_cpu because it
was less expensive to get the node number from the struct page field.
The same is certainly true for a PFMEMALLOC flag.
> Yeah, you're right on the button there. I did my checking assuming that
> PG_active+PG_slab were safe to use. The following is an untested patch that
> I probably got details wrong in but it illustrates where virt_to_page()
> starts cropping up.
Yes you need to come up with a way to not use virt_to_page otherwise slab
performance is significantly impacted. On NUMA we are already doing a page
struct lookup on free in slab. If you would save the page struct pointer
there and reuse it then you would not have an issue at least on free.
You still would need to determine which "struct slab" pointer is in use
which will also require similar lookups in varous places.
Transfer of the pfmemalloc flags (guess you must have a pfmemalloc
field in struct slab then) in slab is best be done when allocating and
freeing a slab page from the page allocator.
I think its rather trivial to add the support you want in a non intrusive
way to slub. Slab would require some more thought and discussion.
On Wed, Feb 08, 2012 at 01:49:05PM -0600, Christoph Lameter wrote:
> On Wed, 8 Feb 2012, Mel Gorman wrote:
>
> > Ok, I looked into what is necessary to replace these with checking a page
> > flag and the cost shifts quite a bit and ends up being more expensive.
>
> That is only true if you go the slab route.
Well, yes but both slab and slub have to be supported. I see no reason
why I would choose to make this a slab-only or slub-only feature. Slob is
not supported because it's not expected that a platform using slob is also
going to use network-based swap.
> Slab suffers from not having
> the page struct pointer readily available. The changes are likely already
> impacting slab performance without the virt_to_page patch.
>
The performance impact only comes into play when swap is on a network
device and pfmemalloc reserves are in use. The rest of the time the check
on ac avoids all the cost and there is a micro-optimisation later to avoid
calling a function (patch 12).
> > In slub, it's sufficient to check kmem_cache_cpu to know whether the
> > objects in the list are pfmemalloc or not.
>
> We try to minimize the size of kmem_cache_cpu. The page pointer is readily
> available. We just removed the node field from kmem_cache_cpu because it
> was less expensive to get the node number from the struct page field.
>
> The same is certainly true for a PFMEMALLOC flag.
>
Ok, are you asking that I use the page flag for slub and leave kmem_cache_cpu
alone in the slub case? I can certainly check it out if that's what you
are asking for.
> > Yeah, you're right on the button there. I did my checking assuming that
> > PG_active+PG_slab were safe to use. The following is an untested patch that
> > I probably got details wrong in but it illustrates where virt_to_page()
> > starts cropping up.
>
> Yes you need to come up with a way to not use virt_to_page otherwise slab
> performance is significantly impacted.
I did come up with a way: the necessary information is in ac and slabp
on slab :/ . There are not exactly many ways that the information can
be recorded.
> On NUMA we are already doing a page struct lookup on free in slab.
> If you would save the page struct pointer
> there and reuse it then you would not have an issue at least on free.
>
That information is only available on NUMA and only when there is more than
one node. Having cache_free_alien return the page for passing to ac_put_obj()
would also be ugly. The biggest downfall by far is that single-node machines
incur the cost of virt_to_page() where they did not have to before. This
is not a solution and it is not better than the current simply check on
a struct field.
> You still would need to determine which "struct slab" pointer is in use
> which will also require similar lookups in varous places.
>
> Transfer of the pfmemalloc flags (guess you must have a pfmemalloc
> field in struct slab then) in slab is best be done when allocating and
> freeing a slab page from the page allocator.
>
The page->pfmemalloc is already been transferred to the slab in
cache_grow.
> I think its rather trivial to add the support you want in a non intrusive
> way to slub. Slab would require some more thought and discussion.
>
I'm slightly confused by this sentence. Support for slub is already in the
patch and as you say, it's fairly straight-forward. Supporting a page flag
and leaving kmem_cache_cpu alone may also be easier as kmem_cache_cpu->page
can be used instead of a kmem_cache_cpu->pfmemalloc field.
--
Mel Gorman
SUSE Labs
On Wed, 8 Feb 2012, Mel Gorman wrote:
> On Wed, Feb 08, 2012 at 01:49:05PM -0600, Christoph Lameter wrote:
> > On Wed, 8 Feb 2012, Mel Gorman wrote:
> >
> > > Ok, I looked into what is necessary to replace these with checking a page
> > > flag and the cost shifts quite a bit and ends up being more expensive.
> >
> > That is only true if you go the slab route.
>
> Well, yes but both slab and slub have to be supported. I see no reason
> why I would choose to make this a slab-only or slub-only feature. Slob is
> not supported because it's not expected that a platform using slob is also
> going to use network-based swap.
I think so far the patches in particular to slab.c are pretty significant
in impact.
> > Slab suffers from not having
> > the page struct pointer readily available. The changes are likely already
> > impacting slab performance without the virt_to_page patch.
> >
>
> The performance impact only comes into play when swap is on a network
> device and pfmemalloc reserves are in use. The rest of the time the check
> on ac avoids all the cost and there is a micro-optimisation later to avoid
> calling a function (patch 12).
We have been down this road too many times. Logic is added to critical
paths and memory structures grow. This is not free. And for NBD swap
support? Pretty exotic use case.
> Ok, are you asking that I use the page flag for slub and leave kmem_cache_cpu
> alone in the slub case? I can certainly check it out if that's what you
> are asking for.
No I am not asking for something. Still thinking about the best way to
address the issues. I think we can easily come up with a minimally
invasive patch for slub. Not sure about slab at this point. I think we
could avoid most of the new fields but this requires some tinkering. I
have a day @ home tomorrow which hopefully gives me a chance to
put some focus on this issue.
> I did come up with a way: the necessary information is in ac and slabp
> on slab :/ . There are not exactly many ways that the information can
> be recorded.
Wish we had something that would not involve increasing the number of
fields in these slab structures.
On Wed, Feb 08, 2012 at 04:13:15PM -0600, Christoph Lameter wrote:
> On Wed, 8 Feb 2012, Mel Gorman wrote:
>
> > On Wed, Feb 08, 2012 at 01:49:05PM -0600, Christoph Lameter wrote:
> > > On Wed, 8 Feb 2012, Mel Gorman wrote:
> > >
> > > > Ok, I looked into what is necessary to replace these with checking a page
> > > > flag and the cost shifts quite a bit and ends up being more expensive.
> > >
> > > That is only true if you go the slab route.
> >
> > Well, yes but both slab and slub have to be supported. I see no reason
> > why I would choose to make this a slab-only or slub-only feature. Slob is
> > not supported because it's not expected that a platform using slob is also
> > going to use network-based swap.
>
> I think so far the patches in particular to slab.c are pretty significant
> in impact.
>
Ok, I am working on a solution that does not affect any of the existing
slab structures. Between that and the fact we check if there are any
memalloc_socks after patch 12, the impact for normal systems is an additional
branch in ac_get_obj() and ac_put_obj()
> > > Slab suffers from not having
> > > the page struct pointer readily available. The changes are likely already
> > > impacting slab performance without the virt_to_page patch.
> > >
> >
> > The performance impact only comes into play when swap is on a network
> > device and pfmemalloc reserves are in use. The rest of the time the check
> > on ac avoids all the cost and there is a micro-optimisation later to avoid
> > calling a function (patch 12).
>
> We have been down this road too many times. Logic is added to critical
> paths and memory structures grow. This is not free. And for NBD swap
> support? Pretty exotic use case.
>
NFS support is the real target. NBD is the logical starting point and
NFS needs the same support.
> > Ok, are you asking that I use the page flag for slub and leave kmem_cache_cpu
> > alone in the slub case? I can certainly check it out if that's what you
> > are asking for.
>
> No I am not asking for something. Still thinking about the best way to
> address the issues. I think we can easily come up with a minimally
> invasive patch for slub. Not sure about slab at this point. I think we
> could avoid most of the new fields but this requires some tinkering. I
> have a day @ home tomorrow which hopefully gives me a chance to
> put some focus on this issue.
>
I think we can avoid adding any additional fields but array_cache needs
a new read-mostly global. Also, when network storage is in use and under
memory pressure, it might be slower as we will have lost granularity on
what slabs are using pfmemalloc. That is an acceptable compromise as it
moves the cost to users of network-based swap instead of normal usage.
> > I did come up with a way: the necessary information is in ac and slabp
> > on slab :/ . There are not exactly many ways that the information can
> > be recorded.
>
> Wish we had something that would not involve increasing the number of
> fields in these slab structures.
>
This is what I currently have. It's untested but builds. It reverts the
structures back to the way they were and uses page flags instead.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e90a673..f96fa87 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -432,6 +432,28 @@ static inline int PageTransCompound(struct page *page)
}
#endif
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ SetPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ ClearPageActive(page);
+}
+
#ifdef CONFIG_MMU
#define __PG_MLOCKED (1 << PG_mlocked)
#else
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 1d9ae40..a32bcfd 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -46,7 +46,6 @@ struct kmem_cache_cpu {
struct page *page; /* The slab from which we are allocating */
struct page *partial; /* Partially allocated frozen slabs */
int node; /* The node of the page (or -1 for debug) */
- bool pfmemalloc; /* Slab page had pfmemalloc set */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
diff --git a/mm/slab.c b/mm/slab.c
index 268cd96..783a92e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -155,6 +155,12 @@
#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
#endif
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active;
+
/* Legal flag mask for kmem_cache_create(). */
#if DEBUG
# define CREATE_MASK (SLAB_RED_ZONE | \
@@ -233,7 +239,6 @@ struct slab {
unsigned int inuse; /* num of objs active in slab */
kmem_bufctl_t free;
unsigned short nodeid;
- bool pfmemalloc; /* Slab had pfmemalloc set */
};
struct slab_rcu __slab_cover_slab_rcu;
};
@@ -255,8 +260,7 @@ struct array_cache {
unsigned int avail;
unsigned int limit;
unsigned int batchcount;
- bool touched;
- bool pfmemalloc;
+ unsigned int touched;
spinlock_t lock;
void *entry[]; /*
* Must have this definition in here for the proper
@@ -978,6 +982,13 @@ static struct array_cache *alloc_arraycache(int node, int entries,
return nc;
}
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+ struct page *page = virt_to_page(slabp->s_mem);
+
+ return PageSlabPfmemalloc(page);
+}
+
/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
static void check_ac_pfmemalloc(struct kmem_cache *cachep,
struct array_cache *ac)
@@ -985,22 +996,22 @@ static void check_ac_pfmemalloc(struct kmem_cache *cachep,
struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
struct slab *slabp;
- if (!ac->pfmemalloc)
+ if (!pfmemalloc_active)
return;
list_for_each_entry(slabp, &l3->slabs_full, list)
- if (slabp->pfmemalloc)
+ if (is_slab_pfmemalloc(slabp))
return;
list_for_each_entry(slabp, &l3->slabs_partial, list)
- if (slabp->pfmemalloc)
+ if (is_slab_pfmemalloc(slabp))
return;
list_for_each_entry(slabp, &l3->slabs_free, list)
- if (slabp->pfmemalloc)
+ if (is_slab_pfmemalloc(slabp))
return;
- ac->pfmemalloc = false;
+ pfmemalloc_active = false;
}
static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
@@ -1036,7 +1047,7 @@ static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
l3 = cachep->nodelists[numa_mem_id()];
if (!list_empty(&l3->slabs_free) && force_refill) {
struct slab *slabp = virt_to_slab(objp);
- slabp->pfmemalloc = false;
+ ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
clear_obj_pfmemalloc(&objp);
check_ac_pfmemalloc(cachep, ac);
return objp;
@@ -1066,13 +1077,10 @@ static inline void *ac_get_obj(struct kmem_cache *cachep,
static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
void *objp)
{
- struct slab *slabp;
-
- /* If there are pfmemalloc slabs, check if the object is part of one */
- if (unlikely(ac->pfmemalloc)) {
- slabp = virt_to_slab(objp);
-
- if (slabp->pfmemalloc)
+ if (unlikely(pfmemalloc_active)) {
+ /* Some pfmemalloc slabs exist, check if this is one */
+ struct page *page = virt_to_page(objp);
+ if (PageSlabPfmemalloc(page))
set_obj_pfmemalloc(&objp);
}
@@ -1906,9 +1914,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
else
add_zone_page_state(page_zone(page),
NR_SLAB_UNRECLAIMABLE, nr_pages);
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
__SetPageSlab(page + i);
+ if (*pfmemalloc)
+ SetPageSlabPfmemalloc(page);
+ }
+
if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
@@ -2888,7 +2900,6 @@ static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
slabp->s_mem = objp + colour_off;
slabp->nodeid = nodeid;
slabp->free = 0;
- slabp->pfmemalloc = false;
return slabp;
}
@@ -3075,11 +3086,8 @@ static int cache_grow(struct kmem_cache *cachep,
goto opps1;
/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
- if (pfmemalloc) {
- struct array_cache *ac = cpu_cache_get(cachep);
- slabp->pfmemalloc = true;
- ac->pfmemalloc = true;
- }
+ if (unlikely(pfmemalloc))
+ pfmemalloc_active = pfmemalloc;
slab_map_pages(cachep, slabp, objp);
diff --git a/mm/slub.c b/mm/slub.c
index ea04994..f9b0f35 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2126,7 +2126,8 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
stat(s, ALLOC_SLAB);
c->node = page_to_nid(page);
c->page = page;
- c->pfmemalloc = pfmemalloc;
+ if (pfmemalloc)
+ SetPageSlabPfmemalloc(page);
*pc = c;
} else
object = NULL;
@@ -2136,7 +2137,7 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
{
- if (unlikely(c->pfmemalloc))
+ if (unlikely(PageSlabPfmemalloc(c->page)))
return gfp_pfmemalloc_allowed(gfpflags);
return true;
On Thu, 9 Feb 2012, Mel Gorman wrote:
> Ok, I am working on a solution that does not affect any of the existing
> slab structures. Between that and the fact we check if there are any
> memalloc_socks after patch 12, the impact for normal systems is an additional
> branch in ac_get_obj() and ac_put_obj()
That sounds good in particular since some other things came up again,
sigh. Have not had time to see if an alternate approach works.
> > We have been down this road too many times. Logic is added to critical
> > paths and memory structures grow. This is not free. And for NBD swap
> > support? Pretty exotic use case.
> >
>
> NFS support is the real target. NBD is the logical starting point and
> NFS needs the same support.
But this is already a pretty strange use case on multiple levels. Swap is
really detrimental to performance. Its a kind of emergency outlet that
gets worse with every new step that increases the differential in
performance between disk and memory. On top of that you want to add
special code in various subsystems to also do that over the network.
Sigh. I think we agreed a while back that we want to limit the amount of
I/O triggered from reclaim paths? AFAICT many filesystems do not support
writeout from reclaim anymore because of all the issues that arise at that
level.
We have numerous other mechanisms that can compress swap etc and provide
ways to work around the problem without I/O which has always be
troublesome and these fixes are likely only to work in a very limited
way causing a lot of maintenance effort because (given the exotic
nature) it is highly likely that there are cornercases that only will be
triggered in rare cases.
On Thu, Feb 09, 2012 at 01:53:58PM -0600, Christoph Lameter wrote:
> On Thu, 9 Feb 2012, Mel Gorman wrote:
>
> > Ok, I am working on a solution that does not affect any of the existing
> > slab structures. Between that and the fact we check if there are any
> > memalloc_socks after patch 12, the impact for normal systems is an additional
> > branch in ac_get_obj() and ac_put_obj()
>
> That sounds good in particular since some other things came up again,
> sigh. Have not had time to see if an alternate approach works.
>
I have an updated version of this 02/15 patch below. It passed testing
and is a lot less invasive than the previous release. As you suggested,
it uses page flags and the bulk of the complexity is only executed if
someone is using network-backed storage.
> > > We have been down this road too many times. Logic is added to critical
> > > paths and memory structures grow. This is not free. And for NBD swap
> > > support? Pretty exotic use case.
> > >
> >
> > NFS support is the real target. NBD is the logical starting point and
> > NFS needs the same support.
>
> But this is already a pretty strange use case on multiple levels. Swap is
> really detrimental to performance. Its a kind of emergency outlet that
> gets worse with every new step that increases the differential in
> performance between disk and memory.
Performance is generally not the concern of the users of swap-over-N[FS|BD].
In the cases I am aware of, they just want an emergency overflow. One user
for example had an application with a sparse mapping larger than physical
memory. During the workload execution it would occasionally push small
parts out to swap and needed network-based swap due to the lack of a local
disk. The performance impact was not a concern because swapping was rare.
> On top of that you want to add
> special code in various subsystems to also do that over the network.
> Sigh. I think we agreed a while back that we want to limit the amount of
> I/O triggered from reclaim paths?
Specifically we wanted to reduce or stop page reclaim calling ->writepage()
for file-backed pages because it generated awful IO patterns and deep
call stacks. We still write anonymous pages from page reclaim because we
do not have a dedicated thread for writing to swap. It is expected that
the call stack for writing to network storage would be less than a
filesystem.
> AFAICT many filesystems do not support
> writeout from reclaim anymore because of all the issues that arise at that
> level.
>
NBD is a block device so filesystem restrictions like you mention do not
apply. In NFS, the direct_IO paths are used to write pages not
->writepage so again the restriction does not apply.
> We have numerous other mechanisms that can compress swap etc and provide
> ways to work around the problem without I/O which has always be
> troublesome and these fixes are likely only to work in a very limited
> way causing a lot of maintenance effort because (given the exotic
> nature) it is highly likely that there are cornercases that only will be
> triggered in rare cases.
Compressing swap only gets you so far. For some workloads, at some point
the anonymous pages have to be written to swap somewhere. If there is a
local disk, great, use it. If there is no disk, then either
hardware-based solutions are needed (HBA that exposes the network as a
block device, works but is expensive), virtualisation is used (the host
os exposes a network-based swapfile as a block device to the guest but
only usable in virtualisation) or you need something like these patches.
Here is the revised 02/15 patch
=== CUT HERE ===
mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.
Pages allocated from the reserve are returned with page->pfmemalloc
set and it is up to the caller to determine how the page should be
protected. SLAB restricts access to any page with page->pfmemalloc set
to callers which are known to able to access the PFMEMALLOC reserve. If
one is not available, an attempt is made to allocate a new page rather
than use a reserve. SLUB is a bit more relaxed in that it only records
if the current per-CPU page was allocated from PFMEMALLOC reserve and
uses another partial slab if the caller does not have the necessary
GFP or process flags. This was found to be sufficient in tests to
avoid hangs due to SLUB generally maintaining smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.
[[email protected]: Original implementation]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm_types.h | 9 ++
include/linux/page-flags.h | 28 +++++++
mm/internal.h | 3 +
mm/page_alloc.c | 27 +++++-
mm/slab.c | 190 +++++++++++++++++++++++++++++++++++++++-----
mm/slub.c | 27 ++++++-
6 files changed, 258 insertions(+), 26 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc3062..56a465f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -53,6 +53,15 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* slub first free object */
+ bool pfmemalloc; /* If set by the page allocator,
+ * ALLOC_PFMEMALLOC was set
+ * and the low watermark was not
+ * met implying that the system
+ * is under some pressure. The
+ * caller should try ensure
+ * this page is only used to
+ * free other pages.
+ */
};
union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e90a673..0c42973 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -432,6 +432,34 @@ static inline int PageTransCompound(struct page *page)
}
#endif
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ SetPageActive(page);
+}
+
+static inline void __ClearPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ __ClearPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ ClearPageActive(page);
+}
+
#ifdef CONFIG_MMU
#define __PG_MLOCKED (1 << PG_mlocked)
#else
diff --git a/mm/internal.h b/mm/internal.h
index 2189af4..bff60d8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -239,6 +239,9 @@ static inline struct page *mem_map_next(struct page *iter,
#define __paginginit __init
#endif
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
/* Memory initialisation debug and verification */
enum mminit_level {
MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b3b8cf..6a3fa1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -695,6 +695,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
trace_mm_page_free(page, order);
kmemcheck_free_shadow(page, order);
+ page->pfmemalloc = false;
if (PageAnon(page))
page->mapping = NULL;
for (i = 0; i < (1 << order); i++)
@@ -1221,6 +1222,7 @@ void free_hot_cold_page(struct page *page, int cold)
migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);
+ page->pfmemalloc = false;
local_irq_save(flags);
if (unlikely(wasMlocked))
free_page_mlock(page);
@@ -1427,6 +1429,7 @@ failed:
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC 0x80 /* Caller has PF_MEMALLOC set */
#ifdef CONFIG_FAIL_PAGE_ALLOC
@@ -2170,16 +2173,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
- if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_interrupt() &&
- ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
+ if ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))) {
+ alloc_flags |= ALLOC_PFMEMALLOC;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
alloc_flags |= ALLOC_NO_WATERMARKS;
}
return alloc_flags;
}
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+ return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2365,8 +2374,16 @@ nopage:
got_pg:
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
- return page;
+ /*
+ * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+ * been OOM killed. The expectation is that the caller is taking
+ * steps that will free more memory. The caller should avoid the
+ * page being used for !PFMEMALLOC purposes.
+ */
+ page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
+ return page;
}
/*
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..f322dc2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@
#include <trace/events/kmem.h>
+#include "internal.h"
+
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
* 0 for faster, smaller code (especially in the critical paths).
@@ -151,6 +153,12 @@
#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
#endif
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active;
+
/* Legal flag mask for kmem_cache_create(). */
#if DEBUG
# define CREATE_MASK (SLAB_RED_ZONE | \
@@ -256,9 +264,30 @@ struct array_cache {
* Must have this definition in here for the proper
* alignment of array_cache. Also simplifies accessing
* the entries.
+ *
+ * Entries should not be directly dereferenced as
+ * entries belonging to slabs marked pfmemalloc will
+ * have the lower bits set SLAB_OBJ_PFMEMALLOC
*/
};
+#define SLAB_OBJ_PFMEMALLOC 1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+ return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+ *objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+ return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+ *objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
/*
* bootstrap: The caches do not work without cpuarrays anymore, but the
* cpuarrays are allocated from the generic caches...
@@ -951,6 +980,98 @@ static struct array_cache *alloc_arraycache(int node, int entries,
return nc;
}
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+ struct page *page = virt_to_page(slabp->s_mem);
+
+ return PageSlabPfmemalloc(page);
+}
+
+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
+ struct array_cache *ac)
+{
+ struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+ struct slab *slabp;
+
+ if (!pfmemalloc_active)
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_full, list)
+ if (is_slab_pfmemalloc(slabp))
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_partial, list)
+ if (is_slab_pfmemalloc(slabp))
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_free, list)
+ if (is_slab_pfmemalloc(slabp))
+ return;
+
+ pfmemalloc_active = false;
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ gfp_t flags, bool force_refill)
+{
+ int i;
+ void *objp = ac->entry[--ac->avail];
+
+ /* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+ if (unlikely(is_obj_pfmemalloc(objp))) {
+ struct kmem_list3 *l3;
+
+ if (gfp_pfmemalloc_allowed(flags)) {
+ clear_obj_pfmemalloc(&objp);
+ return objp;
+ }
+
+ /* The caller cannot use PFMEMALLOC objects, find another one */
+ for (i = 1; i < ac->avail; i++) {
+ /* If a !PFMEMALLOC object is found, swap them */
+ if (!is_obj_pfmemalloc(ac->entry[i])) {
+ objp = ac->entry[i];
+ ac->entry[i] = ac->entry[ac->avail];
+ ac->entry[ac->avail] = objp;
+ return objp;
+ }
+ }
+
+ /*
+ * If there are empty slabs on the slabs_free list and we are
+ * being forced to refill the cache, mark this one !pfmemalloc.
+ */
+ l3 = cachep->nodelists[numa_mem_id()];
+ if (!list_empty(&l3->slabs_free) && force_refill) {
+ struct slab *slabp = virt_to_slab(objp);
+ ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+ clear_obj_pfmemalloc(&objp);
+ check_ac_pfmemalloc(cachep, ac);
+ return objp;
+ }
+
+ /* No !PFMEMALLOC objects available */
+ ac->avail++;
+ objp = NULL;
+ }
+
+ return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ void *objp)
+{
+ if (unlikely(pfmemalloc_active)) {
+ /* Some pfmemalloc slabs exist, check if this is one */
+ struct page *page = virt_to_page(objp);
+ if (PageSlabPfmemalloc(page))
+ set_obj_pfmemalloc(&objp);
+ }
+
+ ac->entry[ac->avail++] = objp;
+}
+
/*
* Transfer objects in one arraycache to another.
* Locking must be handled by the caller.
@@ -1127,7 +1248,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
STATS_INC_ACOVERFLOW(cachep);
__drain_alien_cache(cachep, alien, nodeid);
}
- alien->entry[alien->avail++] = objp;
+ ac_put_obj(cachep, alien, objp);
spin_unlock(&alien->lock);
} else {
spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1760,6 +1881,10 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
if (!page)
return NULL;
+ /* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+ if (unlikely(page->pfmemalloc))
+ pfmemalloc_active = true;
+
nr_pages = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
add_zone_page_state(page_zone(page),
@@ -1767,9 +1892,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
else
add_zone_page_state(page_zone(page),
NR_SLAB_UNRECLAIMABLE, nr_pages);
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
__SetPageSlab(page + i);
+ if (page->pfmemalloc)
+ SetPageSlabPfmemalloc(page + i);
+ }
+
if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
@@ -1802,6 +1931,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
while (i--) {
BUG_ON(!PageSlab(page));
__ClearPageSlab(page);
+ __ClearPageSlabPfmemalloc(page);
page++;
}
if (current->reclaim_state)
@@ -3071,16 +3201,19 @@ bad:
#define check_slabp(x,y) do { } while(0)
#endif
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+ bool force_refill)
{
int batchcount;
struct kmem_list3 *l3;
struct array_cache *ac;
int node;
-retry:
check_irq_off();
node = numa_mem_id();
+ if (unlikely(force_refill))
+ goto force_grow;
+retry:
ac = cpu_cache_get(cachep);
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3130,8 +3263,8 @@ retry:
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);
- ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
- node);
+ ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+ node));
}
check_slabp(cachep, slabp);
@@ -3150,18 +3283,22 @@ alloc_done:
if (unlikely(!ac->avail)) {
int x;
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
- if (!x && ac->avail == 0) /* no objects in sight? abort */
+
+ /* no objects in sight? abort */
+ if (!x && (ac->avail == 0 || force_refill))
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
goto retry;
}
ac->touched = 1;
- return ac->entry[--ac->avail];
+
+ return ac_get_obj(cachep, ac, flags, force_refill);
}
static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3243,23 +3380,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
void *objp;
struct array_cache *ac;
+ bool force_refill = false;
check_irq_off();
ac = cpu_cache_get(cachep);
if (likely(ac->avail)) {
- STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
- objp = ac->entry[--ac->avail];
- } else {
- STATS_INC_ALLOCMISS(cachep);
- objp = cache_alloc_refill(cachep, flags);
+ objp = ac_get_obj(cachep, ac, flags, false);
+
/*
- * the 'ac' may be updated by cache_alloc_refill(),
- * and kmemleak_erase() requires its correct value.
+ * Allow for the possibility all avail objects are not allowed
+ * by the current flags
*/
- ac = cpu_cache_get(cachep);
+ if (objp) {
+ STATS_INC_ALLOCHIT(cachep);
+ goto out;
+ }
+ force_refill = true;
}
+
+ STATS_INC_ALLOCMISS(cachep);
+ objp = cache_alloc_refill(cachep, flags, force_refill);
+ /*
+ * the 'ac' may be updated by cache_alloc_refill(),
+ * and kmemleak_erase() requires its correct value.
+ */
+ ac = cpu_cache_get(cachep);
+
+out:
/*
* To avoid a false negative, if an object that is in one of the
* per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3578,9 +3727,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
struct kmem_list3 *l3;
for (i = 0; i < nr_objects; i++) {
- void *objp = objpp[i];
+ void *objp;
struct slab *slabp;
+ clear_obj_pfmemalloc(&objpp[i]);
+ objp = objpp[i];
+
slabp = virt_to_slab(objp);
l3 = cachep->nodelists[node];
list_del(&slabp->list);
@@ -3693,12 +3845,12 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
if (likely(ac->avail < ac->limit)) {
STATS_INC_FREEHIT(cachep);
- ac->entry[ac->avail++] = objp;
+ ac_put_obj(cachep, ac, objp);
return;
} else {
STATS_INC_FREEMISS(cachep);
cache_flusharray(cachep, ac);
- ac->entry[ac->avail++] = objp;
+ ac_put_obj(cachep, ac, objp);
}
}
diff --git a/mm/slub.c b/mm/slub.c
index 4907563..8eed0de 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -32,6 +32,8 @@
#include <trace/events/kmem.h>
+#include "internal.h"
+
/*
* Lock order:
* 1. slub_lock (Global Semaphore)
@@ -1364,6 +1366,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
inc_slabs_node(s, page_to_nid(page), page->objects);
page->slab = s;
page->flags |= 1 << PG_slab;
+ if (page->pfmemalloc)
+ SetPageSlabPfmemalloc(page);
start = page_address(page);
@@ -1408,6 +1412,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
-pages);
__ClearPageSlab(page);
+ __ClearPageSlabPfmemalloc(page);
reset_page_mapcount(page);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
@@ -2128,6 +2133,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
return object;
}
+static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
+{
+ if (unlikely(PageSlabPfmemalloc(c->page)))
+ return gfp_pfmemalloc_allowed(gfpflags);
+
+ return true;
+}
+
/*
* Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
* or deactivate the page.
@@ -2200,6 +2213,16 @@ redo:
goto new_slab;
}
+ /*
+ * By rights, we should be searching for a slab page that was
+ * PFMEMALLOC but right now, we are losing the pfmemalloc
+ * information when the page leaves the per-cpu allocator
+ */
+ if (unlikely(!pfmemalloc_match(c, gfpflags))) {
+ deactivate_slab(s, c);
+ goto new_slab;
+ }
+
/* must check again c->freelist in case of cpu migration or IRQ */
object = c->freelist;
if (object)
@@ -2304,8 +2327,8 @@ redo:
barrier();
object = c->freelist;
- if (unlikely(!object || !node_match(c, node)))
-
+ if (unlikely(!object || !node_match(c, node) ||
+ !pfmemalloc_match(c, gfpflags)))
object = __slab_alloc(s, gfpflags, node, addr, c);
else {
On Fri, 10 Feb 2012, Mel Gorman wrote:
> I have an updated version of this 02/15 patch below. It passed testing
> and is a lot less invasive than the previous release. As you suggested,
> it uses page flags and the bulk of the complexity is only executed if
> someone is using network-backed storage.
Hmmm.. hmm... Still modifies the hotpaths of the allocators for a
pretty exotic feature.
> > On top of that you want to add
> > special code in various subsystems to also do that over the network.
> > Sigh. I think we agreed a while back that we want to limit the amount of
> > I/O triggered from reclaim paths?
>
> Specifically we wanted to reduce or stop page reclaim calling ->writepage()
> for file-backed pages because it generated awful IO patterns and deep
> call stacks. We still write anonymous pages from page reclaim because we
> do not have a dedicated thread for writing to swap. It is expected that
> the call stack for writing to network storage would be less than a
> filesystem.
>
> > AFAICT many filesystems do not support
> > writeout from reclaim anymore because of all the issues that arise at that
> > level.
> >
>
> NBD is a block device so filesystem restrictions like you mention do not
> apply. In NFS, the direct_IO paths are used to write pages not
> ->writepage so again the restriction does not apply.
Block devices are a little simpler ok. But it is still not a desirable
thing to do (just think about raid and other complex filesystems that may
also have to do allocations).I do not think that block device writers
code with the VM in mind. In the case of network devices as block devices
we have a pretty serious problem since the network subsystem is certainly
not designed to be called from VM reclaim code that may be triggered
arbitrarily from deeply nested other code in the kernel. Implementing
something like this invites breakage all over the place to show up.
> index 8b3b8cf..6a3fa1c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -695,6 +695,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
> trace_mm_page_free(page, order);
> kmemcheck_free_shadow(page, order);
>
> + page->pfmemalloc = false;
> if (PageAnon(page))
> page->mapping = NULL;
> for (i = 0; i < (1 << order); i++)
> @@ -1221,6 +1222,7 @@ void free_hot_cold_page(struct page *page, int cold)
>
> migratetype = get_pageblock_migratetype(page);
> set_page_private(page, migratetype);
> + page->pfmemalloc = false;
> local_irq_save(flags);
> if (unlikely(wasMlocked))
> free_page_mlock(page);
page allocator hotpaths affected.
> diff --git a/mm/slab.c b/mm/slab.c
> index f0bd785..f322dc2 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -123,6 +123,8 @@
>
> #include <trace/events/kmem.h>
>
> +#include "internal.h"
> +
> /*
> * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
> * 0 for faster, smaller code (especially in the critical paths).
> @@ -151,6 +153,12 @@
> #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
> #endif
>
> +/*
> + * true if a page was allocated from pfmemalloc reserves for network-based
> + * swap
> + */
> +static bool pfmemalloc_active;
Implying an additional cacheline use in critical slab paths? Hopefully
grouped with other variables already in cache.
> @@ -3243,23 +3380,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
> {
> void *objp;
> struct array_cache *ac;
> + bool force_refill = false;
... hitting the hotpath here.
> @@ -3693,12 +3845,12 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
>
> if (likely(ac->avail < ac->limit)) {
> STATS_INC_FREEHIT(cachep);
> - ac->entry[ac->avail++] = objp;
> + ac_put_obj(cachep, ac, objp);
> return;
> } else {
> STATS_INC_FREEMISS(cachep);
> cache_flusharray(cachep, ac);
> - ac->entry[ac->avail++] = objp;
> + ac_put_obj(cachep, ac, objp);
> }
> }
and here.
> diff --git a/mm/slub.c b/mm/slub.c
> index 4907563..8eed0de 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2304,8 +2327,8 @@ redo:
> barrier();
>
> object = c->freelist;
> - if (unlikely(!object || !node_match(c, node)))
> -
> + if (unlikely(!object || !node_match(c, node) ||
> + !pfmemalloc_match(c, gfpflags)))
> object = __slab_alloc(s, gfpflags, node, addr, c);
>
> else {
Modification to hotpath. That could be fixed here by forcing pfmemalloc
(like debug allocs) to always go to the slow path and checking in there
instead. Just keep c->freelist == NULL.
Proposal for a patch for slub to move the pfmemalloc handling out of the
fastpath by simply not assigning a per cpu slab when pfmemalloc processing
is going on.
Subject: [slub] Fix so that no mods are required for the fast path
Remove the check for pfmemalloc from the alloc hotpath and put the logic after
the election of a new per cpu slab.
For a pfmemalloc page do not use the fast path but force use of the slow
path (which is also used for the debug case).
Signed-off-by: Christoph Lameter <[email protected]>
---
mm/slub.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2012-02-10 09:58:13.066125970 -0600
+++ linux-2.6/mm/slub.c 2012-02-10 10:06:07.114113000 -0600
@@ -2273,11 +2273,12 @@ new_slab:
}
}
- if (likely(!kmem_cache_debug(s)))
+ if (likely(!kmem_cache_debug(s) && pfmemalloc_match(c, gfpflags)))
goto load_freelist;
+
/* Only entered in the debug case */
- if (!alloc_debug_processing(s, c->page, object, addr))
+ if (kmem_cache_debug(s) && !alloc_debug_processing(s, c->page, object, addr))
goto new_slab; /* Slab failed checks. Next slab needed */
c->freelist = get_freepointer(s, object);
@@ -2327,8 +2328,7 @@ redo:
barrier();
object = c->freelist;
- if (unlikely(!object || !node_match(c, node) ||
- !pfmemalloc_match(c, gfpflags)))
+ if (unlikely(!object || !node_match(c, node)))
object = __slab_alloc(s, gfpflags, node, addr, c);
else {
On Fri, Feb 10, 2012 at 04:07:57PM -0600, Christoph Lameter wrote:
> Proposal for a patch for slub to move the pfmemalloc handling out of the
> fastpath by simply not assigning a per cpu slab when pfmemalloc processing
> is going on.
>
>
>
> Subject: [slub] Fix so that no mods are required for the fast path
>
> Remove the check for pfmemalloc from the alloc hotpath and put the logic after
> the election of a new per cpu slab.
>
> For a pfmemalloc page do not use the fast path but force use of the slow
> path (which is also used for the debug case).
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
This weakens pfmemalloc processing in the following way
1. Process that is performance network swap calls __slab_alloc.
pfmemalloc_match is true so the freelist is loaded and
c->freelist is now pointing to a pfmemalloc page
2. Process that is attempting normal allocations calls slab_alloc,
finds the pfmemalloc page on the freelist and uses it because it
did not check pfmemalloc_match()
The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
the kmalloc slabs being the most vunerable caches on the grounds they are
most likely to have a mix of pfmemalloc and !pfmemalloc requests. Patch
14 will still protect the system as processes will get throttled if the
pfmemalloc reserves get depleted so performance will not degrade as smoothly.
Assuming this passes testing, I'll add the patch to the series with the
information above included in the changelog.
Thanks Christoph.
--
Mel Gorman
SUSE Labs
On Fri, Feb 10, 2012 at 03:01:37PM -0600, Christoph Lameter wrote:
> On Fri, 10 Feb 2012, Mel Gorman wrote:
>
> > I have an updated version of this 02/15 patch below. It passed testing
> > and is a lot less invasive than the previous release. As you suggested,
> > it uses page flags and the bulk of the complexity is only executed if
> > someone is using network-backed storage.
>
> Hmmm.. hmm... Still modifies the hotpaths of the allocators for a
> pretty exotic feature.
>
> > > On top of that you want to add
> > > special code in various subsystems to also do that over the network.
> > > Sigh. I think we agreed a while back that we want to limit the amount of
> > > I/O triggered from reclaim paths?
> >
> > Specifically we wanted to reduce or stop page reclaim calling ->writepage()
> > for file-backed pages because it generated awful IO patterns and deep
> > call stacks. We still write anonymous pages from page reclaim because we
> > do not have a dedicated thread for writing to swap. It is expected that
> > the call stack for writing to network storage would be less than a
> > filesystem.
> >
> > > AFAICT many filesystems do not support
> > > writeout from reclaim anymore because of all the issues that arise at that
> > > level.
> > >
> >
> > NBD is a block device so filesystem restrictions like you mention do not
> > apply. In NFS, the direct_IO paths are used to write pages not
> > ->writepage so again the restriction does not apply.
>
> Block devices are a little simpler ok. But it is still not a desirable
> thing to do (just think about raid and other complex filesystems that may
> also have to do allocations).
Swap IO is never desirable but it has to happen somehow and right now, we
only initiate swap IO from direct reclaim or kswapd. For complex filesystems,
it is mandatory if they are using direct_IO like I do for NFS that they
pin the necessary structures in advance to avoid any allocations in their
reclaim path. I do not expect RAID to be used over network-based swap files.
> I do not think that block device writers
> code with the VM in mind. In the case of network devices as block devices
> we have a pretty serious problem since the network subsystem is certainly
> not designed to be called from VM reclaim code that may be triggered
> arbitrarily from deeply nested other code in the kernel. Implementing
> something like this invites breakage all over the place to show up.
>
The whole point of the series is to allow the possibility of using
network-based swap devices starting with NBD and with NFS in the related
series. swap-over-NFS has been used for the last few years by enterprise
distros and while bugs do get reported, they are rare.
As the person that introduced this, I would support it in mainline for
NBD and NFS if it was merged.
> > index 8b3b8cf..6a3fa1c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -695,6 +695,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
> > trace_mm_page_free(page, order);
> > kmemcheck_free_shadow(page, order);
> >
> > + page->pfmemalloc = false;
> > if (PageAnon(page))
> > page->mapping = NULL;
> > for (i = 0; i < (1 << order); i++)
> > @@ -1221,6 +1222,7 @@ void free_hot_cold_page(struct page *page, int cold)
> >
> > migratetype = get_pageblock_migratetype(page);
> > set_page_private(page, migratetype);
> > + page->pfmemalloc = false;
> > local_irq_save(flags);
> > if (unlikely(wasMlocked))
> > free_page_mlock(page);
>
> page allocator hotpaths affected.
>
I can remove these but then page->pfmemalloc has to be set on the allocation
side. It's a single write to a dirty cache line that is already local to
the processor. It's not measurable although I accept that the page
allocator paths could do with a diet in general.
> > diff --git a/mm/slab.c b/mm/slab.c
> > index f0bd785..f322dc2 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -123,6 +123,8 @@
> >
> > #include <trace/events/kmem.h>
> >
> > +#include "internal.h"
> > +
> > /*
> > * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
> > * 0 for faster, smaller code (especially in the critical paths).
> > @@ -151,6 +153,12 @@
> > #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
> > #endif
> >
> > +/*
> > + * true if a page was allocated from pfmemalloc reserves for network-based
> > + * swap
> > + */
> > +static bool pfmemalloc_active;
>
> Implying an additional cacheline use in critical slab paths?
This was the alternative to altering the slub structures.
> Hopefully grouped with other variables already in cache.
>
This is the expectation. I considered tagging it read_mostly but didn't
at the time. I will now because it is genuinely expected to be
read-mostly and in the case where it is being written to, we're also
using network-based swap and the cost of a cache miss will be negligible
in comparison to swapping under memory pressure to a network.
> > @@ -3243,23 +3380,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
> > {
> > void *objp;
> > struct array_cache *ac;
> > + bool force_refill = false;
>
> ... hitting the hotpath here.
>
> > @@ -3693,12 +3845,12 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
> >
> > if (likely(ac->avail < ac->limit)) {
> > STATS_INC_FREEHIT(cachep);
> > - ac->entry[ac->avail++] = objp;
> > + ac_put_obj(cachep, ac, objp);
> > return;
> > } else {
> > STATS_INC_FREEMISS(cachep);
> > cache_flusharray(cachep, ac);
> > - ac->entry[ac->avail++] = objp;
> > + ac_put_obj(cachep, ac, objp);
> > }
> > }
>
> and here.
>
The impact of ac_put_obj() is reduced in a later patch and becomes just
an additional read of a global variable. There was not an obvious way to
me to ensure pfmemalloc pages were not used for !pfmemalloc allocations
without having some sort of impact.
>
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 4907563..8eed0de 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
>
> > @@ -2304,8 +2327,8 @@ redo:
> > barrier();
> >
> > object = c->freelist;
> > - if (unlikely(!object || !node_match(c, node)))
> > -
> > + if (unlikely(!object || !node_match(c, node) ||
> > + !pfmemalloc_match(c, gfpflags)))
> > object = __slab_alloc(s, gfpflags, node, addr, c);
> >
> > else {
>
>
> Modification to hotpath. That could be fixed here by forcing pfmemalloc
> (like debug allocs) to always go to the slow path and checking in there
> instead. Just keep c->freelist == NULL.
>
I picked up your patch for this, thanks.
--
Mel Gorman
SUSE Labs