2008-01-24 04:19:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 0/8] mem_notify v5

Hi

The /dev/mem_notify is low memory notification device.
it can avoid swappness and oom by cooperationg with the user process.

You need not be annoyed by OOM any longer :)
please any comments!

patch list
[1/8] introduce poll_wait_exclusive() new API
[2/8] introduce wake_up_locked_nr() new API
[3/8] introduce /dev/mem_notify new device (the core of this patch series)
[4/8] memory_pressure_notify() caller
[5/8] add new mem_notify field to /proc/zoneinfo
[6/8] (optional) fixed incorrect shrink_zone
[7/8] ignore very small zone for prevent incorrect low mem notify.
[8/8] support fasync feature


related discussion:
--------------------------------------------------------------
LKML OOM notifications requirement discussion
http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
OOM notifications patch [Marcelo Tosatti]
http://marc.info/?l=linux-kernel&m=119273914027743&w=2
mem notifications v3 [Marcelo Tosatti]
http://marc.info/?l=linux-mm&m=119852828327044&w=2
Thrashing notification patch [Daniel Spang]
http://marc.info/?l=linux-mm&m=119427416315676&w=2
mem notification v4 [kosaki]
http://marc.info/?l=linux-mm&m=120035840523718&w=2


Changelog
-------------------------------------------------
v4 -> v5 (by KOSAKI Motohiro)
o rebase to 2.6.24-rc8-mm1
o change display order of /proc/zoneinfo
o ignore very small zone
o support fcntl(F_SETFL, FASYNC)
o fix some trivial bugs.

v3 -> v4 (by KOSAKI Motohiro)
o rebase to 2.6.24-rc6-mm1
o avoid wake up all.
o add judgement point to __free_one_page().
o add zone awareness.

v2 -> v3 (by Marcelo Tosatti)
o changes the notification point to happen whenever
the VM moves an anonymous page to the inactive list.
o implement notification rate limit.

v1(oom notify) -> v2 (by Marcelo Tosatti)
o name change
o notify timing change from just swap thrashing to
just before thrashing.
o also works with swapless device.





2008-01-24 04:20:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 1/8] mem_notify v5: introduce poll_wait_exclusive() new API

There are 2 way of adding item to wait_queue,
1. add_wait_queue()
2. add_wait_queue_exclusive()
and add_wait_queue_exclusive() is very useful API.

unforunately, poll_wait_exclusive() against poll_wait() doesn't exist.
it means there is no way that wake up only 1 process where polled.
wake_up() is wake up all sleeping process by poll_wait(), not 1 process.

this patch introduce poll_wait_exclusive() new API for allow wake up only 1 process.

<example of usage>
unsigned int kosaki_poll(struct file *file,
struct poll_table_struct *wait)
{
poll_wait_exclusive(file, &kosaki_wait_queue, wait);
if (data_exist)
return POLLIN | POLLRDNORM;
return 0;
}


Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

---
fs/eventpoll.c | 7 +++++--
fs/select.c | 9 ++++++---
include/linux/poll.h | 11 +++++++++--
3 files changed, 20 insertions(+), 7 deletions(-)



Index: linux-2.6.24-rc6-mm1-memnotify/fs/eventpoll.c
===================================================================
--- linux-2.6.24-rc6-mm1-memnotify.orig/fs/eventpoll.c 2008-01-17 18:28:15.000000000 +0900
+++ linux-2.6.24-rc6-mm1-memnotify/fs/eventpoll.c 2008-01-17 18:55:47.000000000 +0900
@@ -675,7 +675,7 @@ out_unlock:
* target file wakeup lists.
*/
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
- poll_table *pt)
+ poll_table *pt, int exclusive)
{
struct epitem *epi = ep_item_from_epqueue(pt);
struct eppoll_entry *pwq;
@@ -684,7 +684,10 @@ static void ep_ptable_queue_proc(struct
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
- add_wait_queue(whead, &pwq->wait);
+ if (exclusive)
+ add_wait_queue_exclusive(whead, &pwq->wait);
+ else
+ add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
Index: linux-2.6.24-rc6-mm1-memnotify/fs/select.c
===================================================================
--- linux-2.6.24-rc6-mm1-memnotify.orig/fs/select.c 2008-01-17 18:28:23.000000000 +0900
+++ linux-2.6.24-rc6-mm1-memnotify/fs/select.c 2008-01-17 18:55:47.000000000 +0900
@@ -48,7 +48,7 @@ struct poll_table_page {
* poll table.
*/
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
- poll_table *p);
+ poll_table *p, int exclusive);

void poll_initwait(struct poll_wqueues *pwq)
{
@@ -117,7 +117,7 @@ static struct poll_table_entry *poll_get

/* Add a new entry */
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
- poll_table *p)
+ poll_table *p, int exclusive)
{
struct poll_table_entry *entry = poll_get_entry(p);
if (!entry)
@@ -126,7 +126,10 @@ static void __pollwait(struct file *filp
entry->filp = filp;
entry->wait_address = wait_address;
init_waitqueue_entry(&entry->wait, current);
- add_wait_queue(wait_address, &entry->wait);
+ if (exclusive)
+ add_wait_queue_exclusive(wait_address, &entry->wait);
+ else
+ add_wait_queue(wait_address, &entry->wait);
}

#define FDS_IN(fds, n) (fds->in + n)
Index: linux-2.6.24-rc6-mm1-memnotify/include/linux/poll.h
===================================================================
--- linux-2.6.24-rc6-mm1-memnotify.orig/include/linux/poll.h 2008-01-17 18:28:32.000000000 +0900
+++ linux-2.6.24-rc6-mm1-memnotify/include/linux/poll.h 2008-01-17 18:55:47.000000000 +0900
@@ -28,7 +28,8 @@ struct poll_table_struct;
/*
* structures and helpers for f_op->poll implementations
*/
-typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
+typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *,
+ struct poll_table_struct *, int);

typedef struct poll_table_struct {
poll_queue_proc qproc;
@@ -37,7 +38,13 @@ typedef struct poll_table_struct {
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && wait_address)
- p->qproc(filp, wait_address, p);
+ p->qproc(filp, wait_address, p, 0);
+}
+
+static inline void poll_wait_exclusive(struct file *filp, wait_queue_head_t *wait_address, poll_table *p)
+{
+ if (p && wait_address)
+ p->qproc(filp, wait_address, p, 1);
}

static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)

2008-01-24 04:20:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 2/8] mem_notify v5: introduce wake_up_locked_nr() new API


introduce new API wake_up_locked_nr() and wake_up_locked_all().
it it similar as wake_up_nr() and wake_up_all(), but it doesn't lock.

Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

---
include/linux/wait.h | 7 +++++--
kernel/sched.c | 5 +++--
2 files changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6.24-rc6-mm1-memnotify/include/linux/wait.h
===================================================================
--- linux-2.6.24-rc6-mm1-memnotify.orig/include/linux/wait.h 2008-01-17 18:28:33.000000000 +0900
+++ linux-2.6.24-rc6-mm1-memnotify/include/linux/wait.h 2008-01-17 18:56:16.000000000 +0900
@@ -142,7 +142,7 @@ static inline void __remove_wait_queue(w
}

void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key));
-extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode));
+void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr, void *key));
extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
void FASTCALL(__wake_up_bit(wait_queue_head_t *, void *, int));
int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned));
@@ -155,7 +155,10 @@ wait_queue_head_t *FASTCALL(bit_waitqueu
#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)
#define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL)
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
-#define wake_up_locked(x) __wake_up_locked((x), TASK_NORMAL)
+
+#define wake_up_locked(x) __wake_up_locked((x), TASK_NORMAL, 1, NULL)
+#define wake_up_locked_nr(x, nr) __wake_up_locked((x), TASK_NORMAL, nr, NULL)
+#define wake_up_locked_all(x) __wake_up_locked((x), TASK_NORMAL, 0, NULL)

#define wake_up_interruptible(x) __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
#define wake_up_interruptible_nr(x, nr) __wake_up(x, TASK_INTERRUPTIBLE, nr, NULL)
Index: linux-2.6.24-rc6-mm1-memnotify/kernel/sched.c
===================================================================
--- linux-2.6.24-rc6-mm1-memnotify.orig/kernel/sched.c 2008-01-17 18:31:12.000000000 +0900
+++ linux-2.6.24-rc6-mm1-memnotify/kernel/sched.c 2008-01-17 18:56:16.000000000 +0900
@@ -3837,9 +3837,10 @@ EXPORT_SYMBOL(__wake_up);
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/
-void __wake_up_locked(wait_queue_head_t *q, unsigned int mode)
+void __wake_up_locked(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, void *key)
{
- __wake_up_common(q, mode, 1, 0, NULL);
+ __wake_up_common(q, mode, nr_exclusive, 0, key);
}

/**

2008-01-24 04:21:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 3/8] mem_notify v5: introduce /dev/mem_notify new device (the core of this patch series)


the core of this patch series.
add /dev/mem_notify device for notification low memory to user process.

<usage examle>

fd = open("/dev/mem_notify", O_RDONLY);
if (fd < 0) {
exit(1);
}
pollfds.fd = fd;
pollfds.events = POLLIN;
pollfds.revents = 0;
err = poll(&pollfds, 1, -1); // wake up at low memory

...
</usage example>


Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

---
Documentation/devices.txt | 1
drivers/char/mem.c | 6 ++
include/linux/mem_notify.h | 42 ++++++++++++++++
include/linux/mmzone.h | 1
mm/Makefile | 2
mm/mem_notify.c | 114 +++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 1
7 files changed, 166 insertions(+), 1 deletion(-)

Index: b/drivers/char/mem.c
===================================================================
--- a/drivers/char/mem.c 2008-01-23 19:21:34.000000000 +0900
+++ b/drivers/char/mem.c 2008-01-23 21:12:44.000000000 +0900
@@ -34,6 +34,8 @@
# include <linux/efi.h>
#endif

+extern struct file_operations mem_notify_fops;
+
/*
* Architectures vary in how they handle caching for addresses
* outside of main memory.
@@ -869,6 +871,9 @@ static int memory_open(struct inode * in
filp->f_op = &oldmem_fops;
break;
#endif
+ case 13:
+ filp->f_op = &mem_notify_fops;
+ break;
default:
return -ENXIO;
}
@@ -901,6 +906,7 @@ static const struct {
#ifdef CONFIG_CRASH_DUMP
{12,"oldmem", S_IRUSR | S_IWUSR | S_IRGRP, &oldmem_fops},
#endif
+ {13,"mem_notify", S_IRUGO, &mem_notify_fops},
};

static struct class *mem_class;
Index: b/include/linux/mem_notify.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ b/include/linux/mem_notify.h 2008-01-23 23:09:32.000000000 +0900
@@ -0,0 +1,42 @@
+/*
+ * Notify applications of memory pressure via /dev/mem_notify
+ *
+ * Copyright (C) 2008 Marcelo Tosatti <[email protected]>,
+ * KOSAKI Motohiro <[email protected]>
+ *
+ * Released under the GPL, see the file COPYING for details.
+ */
+
+#ifndef _LINUX_MEM_NOTIFY_H
+#define _LINUX_MEM_NOTIFY_H
+
+#define MEM_NOTIFY_FREQ (HZ/5)
+
+extern atomic_long_t last_mem_notify;
+
+extern void __memory_pressure_notify(struct zone *zone, int pressure);
+
+
+static inline void memory_pressure_notify(struct zone *zone, int pressure)
+{
+ unsigned long target;
+ unsigned long pages_high, pages_free, pages_reserve;
+
+ if (pressure) {
+ target = atomic_long_read(&last_mem_notify) + MEM_NOTIFY_FREQ;
+ if (likely(time_before(jiffies, target)))
+ return;
+
+ pages_high = zone->pages_high;
+ pages_free = zone_page_state(zone, NR_FREE_PAGES);
+ pages_reserve = zone->lowmem_reserve[MAX_NR_ZONES-1];
+ if (unlikely(pages_free > (pages_high+pages_reserve)*2))
+ return;
+
+ } else if (likely(!zone->mem_notify_status))
+ return;
+
+ __memory_pressure_notify(zone, pressure);
+}
+
+#endif /* _LINUX_MEM_NOTIFY_H */
Index: b/include/linux/mmzone.h
===================================================================
--- a/include/linux/mmzone.h 2008-01-23 19:22:56.000000000 +0900
+++ b/include/linux/mmzone.h 2008-01-23 21:12:44.000000000 +0900
@@ -283,6 +283,7 @@ struct zone {
*/
int prev_priority;

+ int mem_notify_status;

ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
Index: b/mm/Makefile
===================================================================
--- a/mm/Makefile 2008-01-23 19:22:28.000000000 +0900
+++ b/mm/Makefile 2008-01-23 21:12:44.000000000 +0900
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
page_alloc.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
- page_isolation.o $(mmu-y)
+ page_isolation.o mem_notify.o $(mmu-y)

obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
obj-$(CONFIG_BOUNCE) += bounce.o
Index: b/mm/mem_notify.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ b/mm/mem_notify.c 2008-01-23 23:09:31.000000000 +0900
@@ -0,0 +1,114 @@
+/*
+ * Notify applications of memory pressure via /dev/mem_notify
+ *
+ * Copyright (C) 2008 Marcelo Tosatti <[email protected]>,
+ * KOSAKI Motohiro <[email protected]>
+ *
+ * Released under the GPL, see the file COPYING for details.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/timer.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/percpu.h>
+#include <linux/timer.h>
+
+#include <asm/atomic.h>
+
+#define PROC_WAKEUP_GUARD (10*HZ)
+
+struct mem_notify_file_info {
+ unsigned long last_proc_notify;
+};
+
+static DECLARE_WAIT_QUEUE_HEAD(mem_wait);
+static atomic_long_t nr_under_memory_pressure_zones = ATOMIC_LONG_INIT(0);
+static atomic_t nr_watcher_task = ATOMIC_INIT(0);
+
+atomic_long_t last_mem_notify = ATOMIC_LONG_INIT(INITIAL_JIFFIES);
+
+void __memory_pressure_notify(struct zone* zone, int pressure)
+{
+ int nr_wakeup;
+ int flags;
+
+ spin_lock_irqsave(&mem_wait.lock, flags);
+
+ if (pressure != zone->mem_notify_status) {
+ long val = pressure ? 1 : -1;
+ atomic_long_add(val, &nr_under_memory_pressure_zones);
+ zone->mem_notify_status = pressure;
+ }
+
+ if (pressure) {
+ int nr_watcher = atomic_read(&nr_watcher_task);
+
+ nr_wakeup = (nr_watcher >> 4) + 1;
+ if (unlikely(nr_wakeup > 100))
+ nr_wakeup = 100;
+
+ atomic_long_set(&last_mem_notify, jiffies);
+ wake_up_locked_nr(&mem_wait, nr_wakeup);
+ }
+
+ spin_unlock_irqrestore(&mem_wait.lock, flags);
+}
+
+static int mem_notify_open(struct inode *inode, struct file *file)
+{
+ struct mem_notify_file_info *info;
+ int err = 0;
+
+ info = kmalloc(sizeof(*info), GFP_KERNEL);
+ if (!info) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ info->last_proc_notify = INITIAL_JIFFIES;
+ file->private_data = info;
+ atomic_inc(&nr_watcher_task);
+out:
+ return err;
+}
+
+static int mem_notify_release(struct inode *inode, struct file *file)
+{
+ kfree(file->private_data);
+ atomic_dec(&nr_watcher_task);
+ return 0;
+}
+
+static unsigned int mem_notify_poll(struct file *file, poll_table *wait)
+{
+ struct mem_notify_file_info *info = file->private_data;
+ unsigned long now = jiffies;
+ unsigned long timeout;
+ unsigned int retval = 0;
+
+ poll_wait_exclusive(file, &mem_wait, wait);
+
+ timeout = info->last_proc_notify + PROC_WAKEUP_GUARD;
+ if (time_before(now, timeout))
+ goto out;
+
+ if (atomic_long_read(&nr_under_memory_pressure_zones) != 0) {
+ info->last_proc_notify = now;
+ retval = POLLIN;
+ }
+
+out:
+ return retval;
+}
+
+struct file_operations mem_notify_fops = {
+ .open = mem_notify_open,
+ .release = mem_notify_release,
+ .poll = mem_notify_poll,
+};
+EXPORT_SYMBOL(mem_notify_fops);
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c 2008-01-23 19:22:28.000000000 +0900
+++ b/mm/page_alloc.c 2008-01-23 23:09:42.000000000 +0900
@@ -3458,6 +3458,7 @@ static void __meminit free_area_init_cor
zone->zone_pgdat = pgdat;

zone->prev_priority = DEF_PRIORITY;
+ zone->mem_notify_status = 0;

zone_pcp_init(zone);
INIT_LIST_HEAD(&zone->active_list);
Index: b/Documentation/devices.txt
===================================================================
--- a/Documentation/devices.txt 2008-01-23 19:22:33.000000000 +0900
+++ b/Documentation/devices.txt 2008-01-23 21:12:44.000000000 +0900
@@ -96,6 +96,7 @@ Your cooperation is appreciated.
11 = /dev/kmsg Writes to this come out as printk's
12 = /dev/oldmem Used by crashdump kernels to access
the memory of the kernel that crashed.
+ 13 = /dev/mem_notify Low memory notification.

1 block RAM disk
0 = /dev/ram0 First RAM disk

2008-01-24 04:23:05

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 4/8] mem_notify v5: memory_pressure_notify() caller

the notification point to happen whenever the VM moves an
anonymous page to the inactive list - this is a pretty good indication
that there are unused anonymous pages present which will be very likely
swapped out soon.

and, It is judged out of trouble at the fllowing situations.
o memory pressure decrease and stop moves an anonymous page to the inactive list.
o free pages increase than (pages_high+lowmem_reserve)*2.


ChangeLog:
v5: add out of trouble notify to exit of balance_pgdat().


Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

---
mm/page_alloc.c | 12 ++++++++++++
mm/vmscan.c | 26 ++++++++++++++++++++++++++
2 files changed, 38 insertions(+)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c 2008-01-23 22:06:08.000000000 +0900
+++ b/mm/vmscan.c 2008-01-23 22:07:57.000000000 +0900
@@ -39,6 +39,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/memcontrol.h>
+#include <linux/mem_notify.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1089,10 +1090,14 @@ static void shrink_active_list(unsigned
struct page *page;
struct pagevec pvec;
int reclaim_mapped = 0;
+ bool inactivated_anon = 0;

if (sc->may_swap)
reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);

+ if (!reclaim_mapped)
+ memory_pressure_notify(zone, 0);
+
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
@@ -1116,6 +1121,13 @@ static void shrink_active_list(unsigned
if (!reclaim_mapped ||
(total_swap_pages == 0 && PageAnon(page)) ||
page_referenced(page, 0, sc->mem_cgroup)) {
+ /* deal with the case where there is no
+ * swap but an anonymous page would be
+ * moved to the inactive list.
+ */
+ if (!total_swap_pages && reclaim_mapped &&
+ PageAnon(page))
+ inactivated_anon = 1;
list_add(&page->lru, &l_active);
continue;
}
@@ -1123,8 +1135,12 @@ static void shrink_active_list(unsigned
list_add(&page->lru, &l_active);
continue;
}
+ if (PageAnon(page))
+ inactivated_anon = 1;
list_add(&page->lru, &l_inactive);
}
+ if (inactivated_anon)
+ memory_pressure_notify(zone, 1);

pagevec_init(&pvec, 1);
pgmoved = 0;
@@ -1158,6 +1174,8 @@ static void shrink_active_list(unsigned
pagevec_strip(&pvec);
spin_lock_irq(&zone->lru_lock);
}
+ if (!reclaim_mapped)
+ memory_pressure_notify(zone, 0);

pgmoved = 0;
while (!list_empty(&l_active)) {
@@ -1659,6 +1677,14 @@ out:
goto loop_again;
}

+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ if (!populated_zone(zone))
+ continue;
+ memory_pressure_notify(zone, 0);
+ }
+
return nr_reclaimed;
}

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c 2008-01-23 22:06:08.000000000 +0900
+++ b/mm/page_alloc.c 2008-01-23 23:09:32.000000000 +0900
@@ -44,6 +44,7 @@
#include <linux/fault-inject.h>
#include <linux/page-isolation.h>
#include <linux/memcontrol.h>
+#include <linux/mem_notify.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -435,6 +436,8 @@ static inline void __free_one_page(struc
unsigned long page_idx;
int order_size = 1 << order;
int migratetype = get_pageblock_migratetype(page);
+ unsigned long prev_free;
+ unsigned long notify_threshold;

if (unlikely(PageCompound(page)))
destroy_compound_page(page, order);
@@ -444,6 +447,7 @@ static inline void __free_one_page(struc
VM_BUG_ON(page_idx & (order_size - 1));
VM_BUG_ON(bad_range(zone, page));

+ prev_free = zone_page_state(zone, NR_FREE_PAGES);
__mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
@@ -465,6 +469,14 @@ static inline void __free_one_page(struc
list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);
zone->free_area[order].nr_free++;
+
+ notify_threshold = (zone->pages_high +
+ zone->lowmem_reserve[MAX_NR_ZONES-1]) * 2;
+
+ if (unlikely((zone->mem_notify_status == 1) &&
+ (prev_free <= notify_threshold) &&
+ (zone_page_state(zone, NR_FREE_PAGES) > notify_threshold)))
+ memory_pressure_notify(zone, 0);
}

static inline int free_pages_check(struct page *page)

2008-01-24 04:23:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 5/8] mem_notify v5: add new mem_notify field to /proc/zoneinfo

show new member of zone struct by /proc/zoneinfo.

ChangeLog:
v5: change display order to at last.


Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>

---
mm/vmstat.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Index: b/mm/vmstat.c
===================================================================
--- a/mm/vmstat.c 2008-01-23 22:06:05.000000000 +0900
+++ b/mm/vmstat.c 2008-01-23 22:08:00.000000000 +0900
@@ -795,10 +795,12 @@ static void zoneinfo_show_print(struct s
seq_printf(m,
"\n all_unreclaimable: %u"
"\n prev_priority: %i"
- "\n start_pfn: %lu",
- zone_is_all_unreclaimable(zone),
+ "\n start_pfn: %lu"
+ "\n mem_notify_status: %i",
+ zone_is_all_unreclaimable(zone),
zone->prev_priority,
- zone->zone_start_pfn);
+ zone->zone_start_pfn,
+ zone->mem_notify_status);
seq_putc(m, '\n');
}



2008-01-24 04:24:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 6/8] mem_notify v5: (optional) fixed incorrect shrink_zone


on X86, ZONE_DMA is very very small.
It is often no used at all.

Unfortunately,
when NR_ACTIVE==0, NR_INACTIVE==0, shrink_zone() try to reclaim 1 page.
because

zone->nr_scan_active +=
(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
^^^^^

it cause unnecessary low memory notify ;-)


ChangeLog
v5: new

---
mm/vmscan.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c 2008-01-18 14:18:27.000000000 +0900
+++ b/mm/vmscan.c 2008-01-18 14:49:06.000000000 +0900
@@ -948,7 +948,7 @@ static inline void note_zone_scanning_pr

static inline int zone_is_near_oom(struct zone *zone)
{
- return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE)
+ return zone->pages_scanned > (zone_page_state(zone, NR_ACTIVE)
+ zone_page_state(zone, NR_INACTIVE))*3;
}

@@ -1214,18 +1214,29 @@ static unsigned long shrink_zone(int pri
unsigned long nr_inactive;
unsigned long nr_to_scan;
unsigned long nr_reclaimed = 0;
+ unsigned long tmp;
+ unsigned long zone_active;
+ unsigned long zone_inactive;

if (scan_global_lru(sc)) {
/*
* Add one to nr_to_scan just to make sure that the kernel
* will slowly sift through the active list.
*/
- zone->nr_scan_active +=
- (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
+ zone_active = zone_page_state(zone, NR_ACTIVE);
+ tmp = (zone_active >> priority) + 1;
+ if (unlikely(tmp > zone_active))
+ tmp = zone_active;
+ zone->nr_scan_active += tmp;
nr_active = zone->nr_scan_active;
- zone->nr_scan_inactive +=
- (zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
+
+ zone_inactive = zone_page_state(zone, NR_INACTIVE);
+ tmp = (zone_inactive >> priority) + 1;
+ if (unlikely(tmp > zone_inactive))
+ tmp = zone_inactive;
+ zone->nr_scan_inactive += tmp;
nr_inactive = zone->nr_scan_inactive;
+
if (nr_inactive >= sc->swap_cluster_max)
zone->nr_scan_inactive = 0;
else

2008-01-24 04:25:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 7/8] mem_notify v5: ignore very small zone for prevent incorrect low mem notify.

on X86, ZONE_DMA is very very small.
it cause undesirable low mem notification.
It should ignored.

but on other some architecture, ZONE_DMA have 4GB.
4GB is large as it is not possible to ignored.

therefore, ignore or not is decided by zone size.

ChangeLog:
v5: new


Signed-off-by: KOSAKI Motohiro <[email protected]>

---
include/linux/mem_notify.h | 3 +++
mm/page_alloc.c | 6 +++++-
2 files changed, 8 insertions(+), 1 deletion(-)

Index: b/include/linux/mem_notify.h
===================================================================
--- a/include/linux/mem_notify.h 2008-01-23 22:06:04.000000000 +0900
+++ b/include/linux/mem_notify.h 2008-01-23 22:08:02.000000000 +0900
@@ -22,6 +22,9 @@ static inline void memory_pressure_notif
unsigned long target;
unsigned long pages_high, pages_free, pages_reserve;

+ if (unlikely(zone->mem_notify_status == -1))
+ return;
+
if (pressure) {
target = atomic_long_read(&last_mem_notify) + MEM_NOTIFY_FREQ;
if (likely(time_before(jiffies, target)))
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c 2008-01-23 22:07:57.000000000 +0900
+++ b/mm/page_alloc.c 2008-01-23 22:08:02.000000000 +0900
@@ -3470,7 +3470,11 @@ static void __meminit free_area_init_cor
zone->zone_pgdat = pgdat;

zone->prev_priority = DEF_PRIORITY;
- zone->mem_notify_status = 0;
+
+ if (zone->present_pages < (pgdat->node_present_pages / 10))
+ zone->mem_notify_status = -1;
+ else
+ zone->mem_notify_status = 0;

zone_pcp_init(zone);
INIT_LIST_HEAD(&zone->active_list);

2008-01-24 04:26:31

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 8/8] mem_notify v5: support fasync feature


implement FASYNC capability to /dev/mem_notify.

<usage example>
fd = open("/dev/mem_notify", O_RDONLY);

fcntl(fd, F_SETOWN, getpid());

flags = fcntl(fd, F_GETFL);
fcntl(fd, F_SETFL, flags|FASYNC); /* when low memory, receive SIGIO */
</usage example>


ChangeLog
v5: new



Signed-off-by: KOSAKI Motohiro <[email protected]>

---
mm/mem_notify.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 90 insertions(+), 5 deletions(-)

Index: b/mm/mem_notify.c
===================================================================
--- a/mm/mem_notify.c 2008-01-23 23:09:08.000000000 +0900
+++ b/mm/mem_notify.c 2008-01-23 23:09:27.000000000 +0900
@@ -23,18 +23,58 @@
#define PROC_WAKEUP_GUARD (10*HZ)

struct mem_notify_file_info {
- unsigned long last_proc_notify;
+ unsigned long last_proc_notify;
+ struct file *file;
+
+ struct list_head fa_list;
+ int fa_fd;
};

static DECLARE_WAIT_QUEUE_HEAD(mem_wait);
static atomic_long_t nr_under_memory_pressure_zones = ATOMIC_LONG_INIT(0);
static atomic_t nr_watcher_task = ATOMIC_INIT(0);
+static LIST_HEAD(mem_notify_fasync_list);
+static DEFINE_SPINLOCK(mem_notify_fasync_lock);
+static atomic_t nr_fasync_task = ATOMIC_INIT(0);

atomic_long_t last_mem_notify = ATOMIC_LONG_INIT(INITIAL_JIFFIES);

+
+static void mem_notify_kill_fasync_nr(int sig, int band, int nr)
+{
+ struct mem_notify_file_info *iter, *saved_iter;
+ LIST_HEAD(l_fired);
+
+ if (!nr)
+ return;
+
+ spin_lock(&mem_notify_fasync_lock);
+
+ list_for_each_entry_safe_reverse(iter, saved_iter, &mem_notify_fasync_list, fa_list) {
+ struct fown_struct * fown;
+
+ fown = &iter->file->f_owner;
+ if (!(sig == SIGURG && fown->signum == 0))
+ send_sigio(fown, iter->fa_fd, band);
+
+ list_del(&iter->fa_list);
+ list_add(&iter->fa_list, &l_fired);
+ if(!--nr)
+ break;
+ }
+
+ /* rotate moving for FIFO wakeup */
+ list_splice(&l_fired, &mem_notify_fasync_list);
+
+ spin_unlock(&mem_notify_fasync_lock);
+}
+
+
void __memory_pressure_notify(struct zone* zone, int pressure)
{
int nr_wakeup;
+ int nr_poll_wakeup = 0;
+ int nr_fasync_wakeup = 0;
int flags;

spin_lock_irqsave(&mem_wait.lock, flags);
@@ -47,13 +87,18 @@ void __memory_pressure_notify(struct zon

if (pressure) {
int nr_watcher = atomic_read(&nr_watcher_task);
+ int nr_fasync = atomic_read(&nr_fasync_task);

nr_wakeup = (nr_watcher >> 4) + 1;
if (unlikely(nr_wakeup > 100))
nr_wakeup = 100;

+ nr_fasync_wakeup = nr_wakeup * nr_fasync/nr_watcher;
+ nr_poll_wakeup = nr_wakeup - nr_fasync_wakeup;
+
atomic_long_set(&last_mem_notify, jiffies);
- wake_up_locked_nr(&mem_wait, nr_wakeup);
+ wake_up_locked_nr(&mem_wait, nr_poll_wakeup);
+ mem_notify_kill_fasync_nr(SIGIO, POLL_IN, nr_fasync_wakeup);
}

spin_unlock_irqrestore(&mem_wait.lock, flags);
@@ -71,6 +116,9 @@ static int mem_notify_open(struct inode
}

info->last_proc_notify = INITIAL_JIFFIES;
+ INIT_LIST_HEAD(&info->fa_list);
+ info->file = file;
+ info->fa_fd = -1;
file->private_data = info;
atomic_inc(&nr_watcher_task);
out:
@@ -79,7 +127,16 @@ out:

static int mem_notify_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct mem_notify_file_info *info = file->private_data;
+
+ spin_lock(&mem_notify_fasync_lock);
+ if (!list_empty(&info->fa_list)) {
+ list_del(&info->fa_list);
+ atomic_dec(&nr_fasync_task);
+ }
+ spin_unlock(&mem_notify_fasync_lock);
+
+ kfree(info);
atomic_dec(&nr_watcher_task);
return 0;
}
@@ -106,9 +163,37 @@ out:
return retval;
}

+static int mem_notify_fasync(int fd, struct file *filp, int on)
+{
+ struct mem_notify_file_info *info = filp->private_data;
+ int result = 0;
+
+ spin_lock(&mem_notify_fasync_lock);
+ if (on) {
+ if (list_empty(&info->fa_list)) {
+ info->fa_fd = fd;
+ list_add(&info->fa_list, &mem_notify_fasync_list);
+ result = 1;
+ } else {
+ info->fa_fd = fd;
+ }
+ } else {
+ if (!list_empty(&info->fa_list)) {
+ list_del_init(&info->fa_list);
+ info->fa_fd = -1;
+ result = -1;
+ }
+ }
+ if (result != 0)
+ atomic_add(result, &nr_fasync_task);
+ spin_unlock(&mem_notify_fasync_lock);
+ return abs(result);
+}
+
struct file_operations mem_notify_fops = {
- .open = mem_notify_open,
+ .open = mem_notify_open,
.release = mem_notify_release,
- .poll = mem_notify_poll,
+ .poll = mem_notify_poll,
+ .fasync = mem_notify_fasync,
};
EXPORT_SYMBOL(mem_notify_fops);

2008-01-24 12:19:57

by Daniel Spång

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mem_notify v5: introduce /dev/mem_notify new device (the core of this patch series)

Hi KOSAKI,

On 1/24/08, KOSAKI Motohiro <[email protected]> wrote:
> +#define PROC_WAKEUP_GUARD (10*HZ)
[...]
> + timeout = info->last_proc_notify + PROC_WAKEUP_GUARD;

If only one or a few processes are using the system I think 10 seconds
is a little long time to wait before they get the notification again.
Can we decrease this value? Or make it configurable under /proc? Or
make it lower with fewer users? Something like:

timeout = info->last_proc_notify + min(mem_notify_users, PROC_WAKEUP_GUARD);

Cheers,
Daniel

2008-01-25 03:34:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mem_notify v5: introduce /dev/mem_notify new device (the core of this patch series)

Hi Daniel

> > +#define PROC_WAKEUP_GUARD (10*HZ)
> [...]
> > + timeout = info->last_proc_notify + PROC_WAKEUP_GUARD;
>
> If only one or a few processes are using the system I think 10 seconds
> is a little long time to wait before they get the notification again.
> Can we decrease this value? Or make it configurable under /proc? Or
> make it lower with fewer users? Something like:

Oh, that is very interesting issue.
tank you good point out.

after deep thinking, I understand my current implementation is fully stupid.
current, worst case is below.

1. low end
- many process of used only a bit memory(sh, cp etc..) exist.
- 1 memory eater process exist(may be, it is fat browser)
and it watching /dev/mem_notify.

2. high end
- many process of used only a bit memory(sh, cp etc..) exist.
- 1 memory eater process exist(may be, it is DB)
and it watching /dev/mem_notify.

the point is "only 1 process watch /dev/mem_notify", but not a few processor.
I fix it with pleasure.


> timeout = info->last_proc_notify + min(mem_notify_users, PROC_WAKEUP_GUARD);

I like this formula.
the rest problem is decide to default value when only 1 process watch /dev/mem_notify.

What do you think it?
and if my low end worst case situation doesn't match yours,
Could you please explain your usage more?


BTW:
end up, We will add /proc configuration the future.
but I think it is too early.
sometimes configrable parameter prevent the discussion of nicer default value.
Instead, I hope the default value changed by adjust your usage.


- kosaki