LinuxLists.cc - [PATCH 0/7] Kill PF

[permalink] [raw]

Subject: [PATCH 1/7] dm: use __GFP_HIGH instead PF_MEMALLOC

Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
memory, anyone must not prevent it. Otherwise the system cause
mysterious hang-up and/or OOM Killer invokation.

Cc: Alasdair G Kergon <[email protected]>
Cc: [email protected]
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
drivers/md/dm-ioctl.c | 18 +++++++-----------
1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index a679429..4d24b0a 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1396,7 +1396,13 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl **param)
if (tmp.data_size < (sizeof(tmp) - sizeof(tmp.data)))
return -EINVAL;

- dmi = vmalloc(tmp.data_size);
+
+ /*
+ * We use __vmalloc(__GFP_HIGH) instead vmalloc() because trying to
+ * avoid low memory issues when a device is suspended.
+ */
+ dmi = __vmalloc(tmp.data_size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_HIGH,
+ PAGE_KERNEL);
if (!dmi)
return -ENOMEM;

@@ -1473,20 +1479,10 @@ static int ctl_ioctl(uint command, struct dm_ioctl __user *user)
DMWARN("dm_ctl_ioctl: unknown command 0x%x", command);
return -ENOTTY;
}
-
- /*
- * Trying to avoid low memory issues when a device is
- * suspended.
- */
- current->flags |= PF_MEMALLOC;
-
/*
* Copy the parameters into kernel space.
*/
r = copy_params(user, &param);
-
- current->flags &= ~PF_MEMALLOC;
-
if (r)
return r;

--
1.6.2.5

2009-11-17 07:17:51

[permalink] [raw]

Subject: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
memory, anyone must not prevent it. Otherwise the system cause
mysterious hang-up and/or OOM Killer invokation.

Cc: [email protected]
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
drivers/mmc/card/queue.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 49e5823..5deb996 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -46,8 +46,6 @@ static int mmc_queue_thread(void *d)
struct mmc_queue *mq = d;
struct request_queue *q = mq->queue;

- current->flags |= PF_MEMALLOC;
-
down(&mq->thread_sem);
do {
struct request *req = NULL;
--
1.6.2.5

2009-11-17 07:18:39

[permalink] [raw]

Subject: [PATCH 3/7] mtd: Don't use PF_MEMALLOC

2009-11-17 07:19:31

[permalink] [raw]

Subject: [PATCH 4/7] nandsim: Don't use PF_MEMALLOC

Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
memory, anyone must not prevent it. Otherwise the system cause
mysterious hang-up and/or OOM Killer invokation.

Cc: David Woodhouse <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
Cc: [email protected]
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
drivers/mtd/nand/nandsim.c | 22 ++--------------------
1 files changed, 2 insertions(+), 20 deletions(-)

diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
index cd0711b..97a8bbb 100644
--- a/drivers/mtd/nand/nandsim.c
+++ b/drivers/mtd/nand/nandsim.c
@@ -1322,34 +1322,18 @@ static int get_pages(struct nandsim *ns, struct file *file, size_t count, loff_t
return 0;
}

-static int set_memalloc(void)
-{
- if (current->flags & PF_MEMALLOC)
- return 0;
- current->flags |= PF_MEMALLOC;
- return 1;
-}
-
-static void clear_memalloc(int memalloc)
-{
- if (memalloc)
- current->flags &= ~PF_MEMALLOC;
-}
-
static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t *pos)
{
mm_segment_t old_fs;
ssize_t tx;
- int err, memalloc;
+ int err;

err = get_pages(ns, file, count, *pos);
if (err)
return err;
old_fs = get_fs();
set_fs(get_ds());
- memalloc = set_memalloc();
tx = vfs_read(file, (char __user *)buf, count, pos);
- clear_memalloc(memalloc);
set_fs(old_fs);
put_pages(ns);
return tx;
@@ -1359,16 +1343,14 @@ static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, size
{
mm_segment_t old_fs;
ssize_t tx;
- int err, memalloc;
+ int err;

err = get_pages(ns, file, count, *pos);
if (err)
return err;
old_fs = get_fs();
set_fs(get_ds());
- memalloc = set_memalloc();
tx = vfs_write(file, (char __user *)buf, count, pos);
- clear_memalloc(memalloc);
set_fs(old_fs);
put_pages(ns);
return tx;
--
1.6.2.5

2009-11-17 07:21:09

[permalink] [raw]

Subject: [PATCH 5/7] Revert "Intel IOMMU: Avoid memory allocation failures in dma map api calls"

commit eb3fa7cb51 said Intel IOMMU

Intel IOMMU driver needs memory during DMA map calls to setup its
internal page tables and for other data structures. As we all know
that these DMA map calls are mostly called in the interrupt context
or with the spinlock held by the upper level drivers(network/storage
drivers), so in order to avoid any memory allocation failure due to
low memory issues, this patch makes memory allocation by temporarily
setting PF_MEMALLOC flags for the current task before making memory
allocation calls.

We evaluated mempools as a backup when kmem_cache_alloc() fails
and found that mempools are really not useful here because
1) We don't know for sure how much to reserve in advance
2) And mempools are not useful for GFP_ATOMIC case (as we call
memory alloc functions with GFP_ATOMIC)

(akpm: point 2 is wrong...)

The above description doesn't justify to waste system emergency memory
at all. Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need
few memory, anyone must not prevent it. Otherwise the system cause
mysterious hang-up and/or OOM Killer invokation.

Plus, akpm already pointed out what we should do.

Then, this patch revert it.

Cc: Keshavamurthy Anil S <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
drivers/pci/intel-iommu.c | 30 ++++--------------------------
1 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 1840a05..17d6f1e 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -386,31 +386,9 @@ static struct kmem_cache *iommu_domain_cache;
static struct kmem_cache *iommu_devinfo_cache;
static struct kmem_cache *iommu_iova_cache;

-static inline void *iommu_kmem_cache_alloc(struct kmem_cache *cachep)
-{
- unsigned int flags;
- void *vaddr;
-
- /* trying to avoid low memory issues */
- flags = current->flags & PF_MEMALLOC;
- current->flags |= PF_MEMALLOC;
- vaddr = kmem_cache_alloc(cachep, GFP_ATOMIC);
- current->flags &= (~PF_MEMALLOC | flags);
- return vaddr;
-}
-
-
static inline void *alloc_pgtable_page(void)
{
- unsigned int flags;
- void *vaddr;
-
- /* trying to avoid low memory issues */
- flags = current->flags & PF_MEMALLOC;
- current->flags |= PF_MEMALLOC;
- vaddr = (void *)get_zeroed_page(GFP_ATOMIC);
- current->flags &= (~PF_MEMALLOC | flags);
- return vaddr;
+ return (void *)get_zeroed_page(GFP_ATOMIC);
}

static inline void free_pgtable_page(void *vaddr)
@@ -420,7 +398,7 @@ static inline void free_pgtable_page(void *vaddr)

static inline void *alloc_domain_mem(void)
{
- return iommu_kmem_cache_alloc(iommu_domain_cache);
+ return kmem_cache_alloc(iommu_domain_cache, GFP_ATOMIC);
}

static void free_domain_mem(void *vaddr)
@@ -430,7 +408,7 @@ static void free_domain_mem(void *vaddr)

static inline void * alloc_devinfo_mem(void)
{
- return iommu_kmem_cache_alloc(iommu_devinfo_cache);
+ return kmem_cache_alloc(iommu_devinfo_cache, GFP_ATOMIC);
}

static inline void free_devinfo_mem(void *vaddr)
@@ -440,7 +418,7 @@ static inline void free_devinfo_mem(void *vaddr)

struct iova *alloc_iova_mem(void)
{
- return iommu_kmem_cache_alloc(iommu_iova_cache);
+ return kmem_cache_alloc(iommu_iova_cache, GFP_ATOMIC);
}

void free_iova_mem(struct iova *iova)
--
1.6.2.5

2009-11-17 07:22:30

[permalink] [raw]

Subject: [PATCH 6/7] cifs: Don't use PF_MEMALLOC

2009-11-17 07:23:46

[permalink] [raw]

Subject: [PATCH 7/7] xfs: Don't use PF_MEMALLOC

2009-11-17 07:32:25

[permalink] [raw]

Subject: [PATCH] Mark cifs mailing list as "moderated as non-subscribers"

if non-subscribers post bug report to CIFS mailing list, they will get
following messages.

Your mail to 'linux-cifs-client' with the subject

[PATCH x/x] cifs: xxxxxxxxxxxxx

Is being held until the list moderator can review it for approval.

The reason it is being held:

Post by non-member to a members-only list

Either the message will get posted to the list, or you will receive
notification of the moderator's decision. If you would like to cancel
this posting, please visit the following URL:

members-only list should be written as so in MAINTAINERS file.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
MAINTAINERS | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 81d68d5..5c44047 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1418,8 +1418,8 @@ F: include/linux/coda*.h

COMMON INTERNET FILE SYSTEM (CIFS)
M: Steve French <[email protected]>
-L: [email protected]
-L: [email protected]
+L: [email protected] (moderated for non-subscribers)
+L: [email protected] (moderated for non-subscribers)
W: http://linux-cifs.samba.org/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6.git
S: Supported
--
1.6.2.5

2009-11-17 08:07:49

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

On Tue, 17 Nov 2009, KOSAKI Motohiro wrote:

>
> PF_MEMALLOC have following effects.
> (1) Ignore zone watermark
> (2) Don't call reclaim although allocation failure, instead return ENOMEM
> (3) Don't invoke OOM Killer
> (4) Don't retry internally in page alloc
>
> Some subsystem paid attention (1) only, and start to use PF_MEMALLOC abuse.
> But, the fact is, PF_MEMALLOC is the promise of "I have lots freeable memory.
> if I allocate few memory, I can return more much meory to the system!".
> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim
> need few memory, anyone must not prevent it. Otherwise the system cause
> mysterious hang-up and/or OOM Killer invokation.
>
> if many subsystem will be able to use emergency memory without any
> usage rule, it isn't for emergency. it can become empty easily.
>
> Plus, characteristics (2)-(4) mean PF_MEMALLOC don't fit to general
> high priority memory allocation.
>
> Thus, We kill all PF_MEMALLOC usage in no MM subsystem.
>

I agree in principle with removing non-VM users of PF_MEMALLOC, but I
think it should be left to the individual subsystem maintainers to apply
or ack since the allocations may depend on the __GFP_NORETRY | ~__GFP_WAIT
behavior of PF_MEMALLOC. This could be potentially dangerous for a
PF_MEMALLOC user if allocations made by the kthread, for example, should
never retry for orders smaller than PAGE_ALLOC_COSTLY_ORDER or block on
direct reclaim.

2009-11-17 08:33:47

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

> On Tue, 17 Nov 2009, KOSAKI Motohiro wrote:
>
> >
> > PF_MEMALLOC have following effects.
> > (1) Ignore zone watermark
> > (2) Don't call reclaim although allocation failure, instead return ENOMEM
> > (3) Don't invoke OOM Killer
> > (4) Don't retry internally in page alloc
> >
> > Some subsystem paid attention (1) only, and start to use PF_MEMALLOC abuse.
> > But, the fact is, PF_MEMALLOC is the promise of "I have lots freeable memory.
> > if I allocate few memory, I can return more much meory to the system!".
> > Non MM subsystem must not use PF_MEMALLOC. Memory reclaim
> > need few memory, anyone must not prevent it. Otherwise the system cause
> > mysterious hang-up and/or OOM Killer invokation.
> >
> > if many subsystem will be able to use emergency memory without any
> > usage rule, it isn't for emergency. it can become empty easily.
> >
> > Plus, characteristics (2)-(4) mean PF_MEMALLOC don't fit to general
> > high priority memory allocation.
> >
> > Thus, We kill all PF_MEMALLOC usage in no MM subsystem.
>
> I agree in principle with removing non-VM users of PF_MEMALLOC, but I
> think it should be left to the individual subsystem maintainers to apply
> or ack since the allocations may depend on the __GFP_NORETRY | ~__GFP_WAIT
> behavior of PF_MEMALLOC. This could be potentially dangerous for a
> PF_MEMALLOC user if allocations made by the kthread, for example, should
> never retry for orders smaller than PAGE_ALLOC_COSTLY_ORDER or block on
> direct reclaim.

if there is so such reason. we might need to implement another MM trick.
but keeping this strage usage is not a option. All memory freeing activity
(e.g. page out, task killing) need some memory. we need to protect its
emergency memory. otherwise linux reliability decrease dramatically when
the system face to memory stress.

2009-11-17 08:36:44

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

On Tue, 17 Nov 2009, KOSAKI Motohiro wrote:

> > I agree in principle with removing non-VM users of PF_MEMALLOC, but I
> > think it should be left to the individual subsystem maintainers to apply
> > or ack since the allocations may depend on the __GFP_NORETRY | ~__GFP_WAIT
> > behavior of PF_MEMALLOC. This could be potentially dangerous for a
> > PF_MEMALLOC user if allocations made by the kthread, for example, should
> > never retry for orders smaller than PAGE_ALLOC_COSTLY_ORDER or block on
> > direct reclaim.
>
> if there is so such reason. we might need to implement another MM trick.
> but keeping this strage usage is not a option. All memory freeing activity
> (e.g. page out, task killing) need some memory. we need to protect its
> emergency memory. otherwise linux reliability decrease dramatically when
> the system face to memory stress.
>

Right, that's why I agree with trying to remove non-VM use of PF_MEMALLOC,
but I think this patchset needs to go through the individual subsystem
maintainers so they can ensure the conversion doesn't cause undesirable
results if their kthreads' memory allocations depend on the __GFP_NORETRY
behavior that PF_MEMALLOC ensures. Otherwise it looks good.

2009-11-17 10:15:23

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

On Tue, Nov 17, 2009 at 04:16:04PM +0900, KOSAKI Motohiro wrote:
> Some subsystem paid attention (1) only, and start to use PF_MEMALLOC abuse.
> But, the fact is, PF_MEMALLOC is the promise of "I have lots freeable memory.
> if I allocate few memory, I can return more much meory to the system!".
> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim
> need few memory, anyone must not prevent it. Otherwise the system cause
> mysterious hang-up and/or OOM Killer invokation.

And that's exactly the promises xfsbufd gives. It writes out dirty
metadata buffers and will free lots of memory if you kick it.

2009-11-17 10:24:40

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

> On Tue, Nov 17, 2009 at 04:16:04PM +0900, KOSAKI Motohiro wrote:
> > Some subsystem paid attention (1) only, and start to use PF_MEMALLOC abuse.
> > But, the fact is, PF_MEMALLOC is the promise of "I have lots freeable memory.
> > if I allocate few memory, I can return more much meory to the system!".
> > Non MM subsystem must not use PF_MEMALLOC. Memory reclaim
> > need few memory, anyone must not prevent it. Otherwise the system cause
> > mysterious hang-up and/or OOM Killer invokation.
>
> And that's exactly the promises xfsbufd gives. It writes out dirty
> metadata buffers and will free lots of memory if you kick it.

if xfsbufd doesn't only write out dirty data but also drop page,
I agree you.

2009-11-17 10:26:59

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

On Tue, Nov 17, 2009 at 07:24:42PM +0900, KOSAKI Motohiro wrote:
> if xfsbufd doesn't only write out dirty data but also drop page,
> I agree you.

It then drops the reference to the buffer which drops references to the
pages, which often are the last references, yes.

2009-11-17 10:27:21

by Alan

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

On Tue, 17 Nov 2009 16:17:50 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> memory, anyone must not prevent it. Otherwise the system cause
> mysterious hang-up and/or OOM Killer invokation.

So now what happens if we are paging and all our memory is tied up for
writeback to a device or CIFS etc which can no longer allocate the memory
to complete the write out so the MM can reclaim ?

Am I missing something or is this patch set not addressing the case where
the writeback thread needs to inherit PF_MEMALLOC somehow (at least for
the I/O in question and those blocking it)

Alan

2009-11-17 10:32:32

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

On Tue, Nov 17, 2009 at 7:29 PM, Alan Cox <[email protected]> wrote:
> On Tue, 17 Nov 2009 16:17:50 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
>> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
>> memory, anyone must not prevent it. Otherwise the system cause
>> mysterious hang-up and/or OOM Killer invokation.
>
> So now what happens if we are paging and all our memory is tied up for
> writeback to a device or CIFS etc which can no longer allocate the memory
> to complete the write out so the MM can reclaim ?
>
> Am I missing something or is this patch set not addressing the case where
> the writeback thread needs to inherit PF_MEMALLOC somehow (at least for
> the I/O in question and those blocking it)
>

I agree.
At least, drivers for writeout is proper for using PF_MEMALLOC, I think.

> Alan
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Kind regards,
Minchan Kim

2009-11-17 10:38:00

by Oliver Neukum

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

Am Dienstag, 17. November 2009 11:32:36 schrieb Minchan Kim:
> On Tue, Nov 17, 2009 at 7:29 PM, Alan Cox <[email protected]> wrote:
> > On Tue, 17 Nov 2009 16:17:50 +0900 (JST)
> >
> > KOSAKI Motohiro <[email protected]> wrote:
> >> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> >> memory, anyone must not prevent it. Otherwise the system cause
> >> mysterious hang-up and/or OOM Killer invokation.
> >
> > So now what happens if we are paging and all our memory is tied up for
> > writeback to a device or CIFS etc which can no longer allocate the memory
> > to complete the write out so the MM can reclaim ?
> >
> > Am I missing something or is this patch set not addressing the case where
> > the writeback thread needs to inherit PF_MEMALLOC somehow (at least for
> > the I/O in question and those blocking it)
>
> I agree.
> At least, drivers for writeout is proper for using PF_MEMALLOC, I think.

For the same reason error handling should also use it, shouldn't it?

Regards
Oliver

2009-11-17 11:58:06

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

> On Tue, 17 Nov 2009 16:17:50 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> > memory, anyone must not prevent it. Otherwise the system cause
> > mysterious hang-up and/or OOM Killer invokation.
>
> So now what happens if we are paging and all our memory is tied up for
> writeback to a device or CIFS etc which can no longer allocate the memory
> to complete the write out so the MM can reclaim ?

Probably my answer is not so simple. sorry.

reason1: MM reclaim does both dropping clean memory and writing out dirty pages.
reason2: if all memory is exhausted, maybe we can't recover it. it is
fundamental limitation of Virtual Memory subsystem. and, min-watermark is
decided by number of system physcal memory, but # of I/O issue (i.e. # of
pages of used by writeback thread) is mainly decided # of devices.
then, we can't gurantee min-watermark is sufficient on any systems.
Only reasonable solution is mempool like reservation, I think.
IOW, any reservation memory shouldn't share unrelated subsystem. otherwise
we lost any gurantee.

So, I think we need to hear why many developer don't use mempool,
instead use PF_MEMALLOC.

> Am I missing something or is this patch set not addressing the case where
> the writeback thread needs to inherit PF_MEMALLOC somehow (at least for
> the I/O in question and those blocking it)

Yes, probably my patchset isn't perfect. honestly I haven't understand
why so many developer prefer to use PF_MEMALLOC.

2009-11-17 12:24:21

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

> On Tue, Nov 17, 2009 at 07:24:42PM +0900, KOSAKI Motohiro wrote:
> > if xfsbufd doesn't only write out dirty data but also drop page,
> > I agree you.
>
> It then drops the reference to the buffer which drops references to the
> pages, which often are the last references, yes.

I though it is not typical case. Am I wrong?
if so, I'm sorry. I'm not XFS expert.

2009-11-17 12:47:14

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

On Tue, Nov 17, 2009 at 09:24:24PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Nov 17, 2009 at 07:24:42PM +0900, KOSAKI Motohiro wrote:
> > > if xfsbufd doesn't only write out dirty data but also drop page,
> > > I agree you.
> >
> > It then drops the reference to the buffer which drops references to the
> > pages, which often are the last references, yes.
>
> I though it is not typical case. Am I wrong?
> if so, I'm sorry. I'm not XFS expert.

I think in the typical case it's the last reference. The are two
reasons why it might not be:

- we're on a filesystem with block size < page size in which case two
buffers can share a page and we'd need to write out and release both
buffers to free the page
- someone else might have a reference on the buffer. Offhand I can't
remember a place where we do this for delayed write buffers (which
is what xfsbufd writes out) as it would be a bit against the purpose
of those delayed write buffers.

2009-11-17 12:48:17

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 6/7] cifs: Don't use PF_MEMALLOC

On Tue, 17 Nov 2009 16:22:32 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

>
> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> memory, anyone must not prevent it. Otherwise the system cause
> mysterious hang-up and/or OOM Killer invokation.
>
> Cc: Steve French <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> fs/cifs/connect.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> index 63ea83f..f9b1553 100644
> --- a/fs/cifs/connect.c
> +++ b/fs/cifs/connect.c
> @@ -337,7 +337,6 @@ cifs_demultiplex_thread(struct TCP_Server_Info *server)
> bool isMultiRsp;
> int reconnect;
>
> - current->flags |= PF_MEMALLOC;
> cFYI(1, ("Demultiplex PID: %d", task_pid_nr(current)));
>
> length = atomic_inc_return(&tcpSesAllocCount);

This patch appears to be safe for CIFS. I believe that the demultiplex
thread only does mempool allocations currently. The only other case
where it did an allocation was recently changed with the conversion of
the oplock break code to use slow_work.

Barring anything I've missed...

Acked-by: Jeff Layton <[email protected]>

2009-11-17 12:51:15

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

Sorry for the noise.
While I am typing, my mail client already send the mail. :(.
This is genuine.

KOSAKI Motohiro wrote:
>> On Tue, 17 Nov 2009 16:17:50 +0900 (JST)
>> KOSAKI Motohiro <[email protected]> wrote:
>>
>>> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
>>> memory, anyone must not prevent it. Otherwise the system cause
>>> mysterious hang-up and/or OOM Killer invokation.
>> So now what happens if we are paging and all our memory is tied up for
>> writeback to a device or CIFS etc which can no longer allocate the memory
>> to complete the write out so the MM can reclaim ?
>
> Probably my answer is not so simple. sorry.
>
> reason1: MM reclaim does both dropping clean memory and writing out dirty pages.

Who write out dirty pages?
If block driver can't allocate pages for flushing, It means VM can't reclaim
dirty pages after all.

> reason2: if all memory is exhausted, maybe we can't recover it. it is
> fundamental limitation of Virtual Memory subsystem. and, min-watermark is
> decided by number of system physcal memory, but # of I/O issue (i.e. # of
> pages of used by writeback thread) is mainly decided # of devices.
> then, we can't gurantee min-watermark is sufficient on any systems.
> Only reasonable solution is mempool like reservation, I think.

I think it's because mempool reserves memory.
(# of I/O issue\0 is hard to be expected.
How do we determine mempool size of each block driver?
For example, maybe, server use few I/O for nand.
but embedded system uses a lot of I/O.

We need another knob for each block driver?

I understand your point. but it's not simple.
I think, for making sure VM's pages, block driver need to distinguish
normal flush path and flush patch for reclaiming.
So In case of flushing for reclaiming, block driver have to set PF_MEMALLOC.
otherwise, it shouldn't set PF_MEMALLOC.

> IOW, any reservation memory shouldn't share unrelated subsystem. otherwise
> we lost any gurantee.
>
> So, I think we need to hear why many developer don't use mempool,
> instead use PF_MEMALLOC.
>
>> Am I missing something or is this patch set not addressing the case where
>> the writeback thread needs to inherit PF_MEMALLOC somehow (at least for
>> the I/O in question and those blocking it)
>
> Yes, probably my patchset isn't perfect. honestly I haven't understand
> why so many developer prefer to use PF_MEMALLOC.
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-11-17 13:15:34

by Alasdair G Kergon

[permalink] [raw]

Subject: Re: [PATCH 1/7] dm: use __GFP_HIGH instead PF_MEMALLOC

On Tue, Nov 17, 2009 at 04:17:07PM +0900, KOSAKI Motohiro wrote:
> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> memory, anyone must not prevent it. Otherwise the system cause
> mysterious hang-up and/or OOM Killer invokation.

This code is also on the critical path, for example, if you are swapping
onto a dm device. (There are ways we could reduce its use further as
not every dm ioctl needs to be on the critical path and the buffer size
could be limited for the ioctls that do.)

But what situations have been causing you trouble? The OOM killer must
generally avoid killing userspace processes that suspend & resume dm
devices, and there are tight restrictions on what those processes
can do safely between suspending and resuming.

Alasdair

2009-11-17 16:41:50

by Steve French

[permalink] [raw]

Subject: Re: [PATCH 6/7] cifs: Don't use PF_MEMALLOC

It is hard to follow exactly what this flag does in /mm (other than
try harder on memory allocations) - I haven't found much about this
flag (e.g. http://lwn.net/Articles/246928/) but it does look like most
of the fs no longer set this (except xfs) e.g. ext3_ordered_writepage.
When running out of memory in the cifs_demultiplex_thread it will
retry 3 seconds later, but if memory allocations ever fail in this
path we could potentially be holding up (an already issued write in)
writepages for that period by not having memory to get the response to
see if the write succeeded.

We pass in few flags for these memory allocation requests: GFP_NOFS
(on the mempool_alloc) and SLAB_HWCACHE_ALIGN (on the
kmem_cache_create of the pool) should we be passing in other flags on
the allocations?

On Tue, Nov 17, 2009 at 6:47 AM, Jeff Layton <[email protected]> wrote:
>
> On Tue, 17 Nov 2009 16:22:32 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> >
> > Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> > memory, anyone must not prevent it. Otherwise the system cause
> > mysterious hang-up and/or OOM Killer invokation.
> >
> > Cc: Steve French <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > ---
> > ?fs/cifs/connect.c | ? ?1 -
> > ?1 files changed, 0 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> > index 63ea83f..f9b1553 100644
> > --- a/fs/cifs/connect.c
> > +++ b/fs/cifs/connect.c
> > @@ -337,7 +337,6 @@ cifs_demultiplex_thread(struct TCP_Server_Info *server)
> > ? ? ? bool isMultiRsp;
> > ? ? ? int reconnect;
> >
> > - ? ? current->flags |= PF_MEMALLOC;
> > ? ? ? cFYI(1, ("Demultiplex PID: %d", task_pid_nr(current)));
> >
> > ? ? ? length = atomic_inc_return(&tcpSesAllocCount);
>
> This patch appears to be safe for CIFS. I believe that the demultiplex
> thread only does mempool allocations currently. The only other case
> where it did an allocation was recently changed with the conversion of
> the oplock break code to use slow_work.
>
> Barring anything I've missed...
>
> Acked-by: Jeff Layton <[email protected]>

--
Thanks,

Steve

2009-11-17 20:47:16

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

On Tue, 2009-11-17 at 21:51 +0900, Minchan Kim wrote:
> I think it's because mempool reserves memory.
> (# of I/O issue\0 is hard to be expected.
> How do we determine mempool size of each block driver?
> For example, maybe, server use few I/O for nand.
> but embedded system uses a lot of I/O.

No, you scale the mempool to the minimum amount required to make
progress -- this includes limiting the 'concurrency' when handing out
mempool objects.

If you run into such tight corners often enough to notice it, there's
something else wrong.

I fully agree with ripping out PF_MEMALLOC from pretty much everything,
including the VM, getting rid of the various abuse outside of the VM
seems like a very good start.

2009-11-17 20:56:24

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

On Tue, 2009-11-17 at 17:33 +0900, KOSAKI Motohiro wrote:
>
> if there is so such reason. we might need to implement another MM trick.
> but keeping this strage usage is not a option. All memory freeing activity
> (e.g. page out, task killing) need some memory. we need to protect its
> emergency memory. otherwise linux reliability decrease dramatically when
> the system face to memory stress.

In general PF_MEMALLOC is a particularly bad idea, even for the VM when
not coupled with limiting the consumption. That is one should make an
upper-bound estimation of the memory needed for a writeout-path per
page, and reserve a small multiple thereof, and limit the number of
pages written out so as to never exceed this estimate.

If the current mempool interface isn't sufficient (not hard to imagine),
look at the swap over NFS patch-set, that includes a much more able
reservation scheme, and accounting framework.

2009-11-17 22:11:54

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 7/7] xfs: Don't use PF_MEMALLOC

On Tue, Nov 17, 2009 at 04:23:43PM +0900, KOSAKI Motohiro wrote:
>
> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> memory, anyone must not prevent it. Otherwise the system cause
> mysterious hang-up and/or OOM Killer invokation.

The xfsbufd is a woken run by a registered memory shaker. i.e. it
runs when the system needs to reclaim memory. It forceѕ the
delayed write metadata buffers (of which there can be a lot) to disk
so that they can be reclaimed on IO completion. This IO submission
may require ѕome memory to be allocated to be able to free that
memory.

Hence, AFAICT the use of PF_MEMALLOC is valid here.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-11-18 00:01:08

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

Hi, Peter.

First of all, Thanks for the commenting.

On Wed, Nov 18, 2009 at 5:47 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2009-11-17 at 21:51 +0900, Minchan Kim wrote:
>> I think it's because mempool reserves memory.
>> (# of I/O issue\0 is hard to be expected.
>> How do we determine mempool size of each block driver?
>> For example, maybe, server use few I/O for nand.
>> but embedded system uses a lot of I/O.
>
> No, you scale the mempool to the minimum amount required to make
> progress -- this includes limiting the 'concurrency' when handing out
> mempool objects.
>
> If you run into such tight corners often enough to notice it, there's
> something else wrong.
>
> I fully agree with ripping out PF_MEMALLOC from pretty much everything,
> including the VM, getting rid of the various abuse outside of the VM
> seems like a very good start.
>

I am not against removing PF_MEMALLOC.
Totally, I agree to prevent abusing of PF_MEMALLOC.

What I have a concern is per-block mempool.
Although it's minimum amount of mempool, it can be increased
by adding new block driver. I am not sure how many we will have block driver.

And, person who develop new driver always have to use mempool and consider
what is minimum of mempool.
I think this is a problem of mempool, now.

How about this?
According to system memory, kernel have just one mempool for I/O which
is one shared by several block driver.

And we make new API block driver can use.
Of course, as usual It can use dynamic memoy. Only it can use mempool if
system don't have much dynamic memory.

In this case, we can control read/write path. read I/O can't help
memory reclaiming.
So I think read I/O don't use mempool, I am not sure. :)

--
Kind regards,
Minchan Kim

2009-11-18 05:55:50

[permalink] [raw]

Subject: Re: [PATCH 0/7] Kill PF_MEMALLOC abuse

> On Tue, 2009-11-17 at 17:33 +0900, KOSAKI Motohiro wrote:
> >
> > if there is so such reason. we might need to implement another MM trick.
> > but keeping this strage usage is not a option. All memory freeing activity
> > (e.g. page out, task killing) need some memory. we need to protect its
> > emergency memory. otherwise linux reliability decrease dramatically when
> > the system face to memory stress.
>
> In general PF_MEMALLOC is a particularly bad idea, even for the VM when
> not coupled with limiting the consumption. That is one should make an
> upper-bound estimation of the memory needed for a writeout-path per
> page, and reserve a small multiple thereof, and limit the number of
> pages written out so as to never exceed this estimate.
>
> If the current mempool interface isn't sufficient (not hard to imagine),
> look at the swap over NFS patch-set, that includes a much more able
> reservation scheme, and accounting framework.

Yes, I agree.

In this discussion, some people explained why their subsystem need
emergency memory, but nobody claim sharing memory pool against VM and
surely want to stop reclaim (PF_MEMALLOC's big side effect).

OK. I try to review your patch carefully and remake this patch series on top
your reservation framework in swap-over-nfs patch series.

2009-11-18 06:17:52

[permalink] [raw]

Subject: Re: [PATCH 1/7] dm: use __GFP_HIGH instead PF_MEMALLOC

Hi,

Thank you for give me comment.

> On Tue, Nov 17, 2009 at 04:17:07PM +0900, KOSAKI Motohiro wrote:
> > Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> > memory, anyone must not prevent it. Otherwise the system cause
> > mysterious hang-up and/or OOM Killer invokation.
>
> This code is also on the critical path, for example, if you are swapping
> onto a dm device. (There are ways we could reduce its use further as
> not every dm ioctl needs to be on the critical path and the buffer size
> could be limited for the ioctls that do.)

May I ask one additional question?
Original code is here.

-------------------------------------------------------
/*
* Trying to avoid low memory issues when a device is
* suspended.
*/
current->flags |= PF_MEMALLOC;

/*
* Copy the parameters into kernel space.
*/
r = copy_params(user, &param);

current->flags &= ~PF_MEMALLOC;
-------------------------------------------------------

but PF_MEMALLOC doesn't gurantee allocation successfull. In your case,
mempoll seems better to me. copy_params seems enough small function
and we can rewrite it. Why didn't you use mempool?

Am I missing something?

> But what situations have been causing you trouble? The OOM killer must
> generally avoid killing userspace processes that suspend & resume dm
> devices, and there are tight restrictions on what those processes
> can do safely between suspending and resuming.

No. This is theorical issue. but I really want to avoid stress weakness
kernel.

2009-11-18 06:31:20

[permalink] [raw]

Subject: Re: [PATCH 6/7] cifs: Don't use PF_MEMALLOC

> It is hard to follow exactly what this flag does in /mm (other than try
> harder on memory allocations) - I haven't found much about this flag (e.g.
> http://lwn.net/Articles/246928/) but it does look like most of the fs no
> longer set this (except xfs) e.g. ext3_ordered_writepage. When running out
> of memory in the cifs_demultiplex_thread it will retry 3 seconds later, but
> if memory allocations ever fail in this path we could potentially be holding
> up (an already issued write in) writepages for that period by not having
> memory to get the response to see if the write succeeded.
>
> We pass in few flags for these memory allocation requests: GFP_NOFS (on the
> mempool_alloc) and SLAB_HWCACHE_ALIGN (on the kmem_cache_create of the pool)
> should we be passing in other flags on the allocations?

I don't think you need change more.

2009-11-18 08:56:44

[permalink] [raw]

Subject: Re: [PATCH 7/7] xfs: Don't use PF_MEMALLOC

> On Tue, Nov 17, 2009 at 04:23:43PM +0900, KOSAKI Motohiro wrote:
> >
> > Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> > memory, anyone must not prevent it. Otherwise the system cause
> > mysterious hang-up and/or OOM Killer invokation.
>
> The xfsbufd is a woken run by a registered memory shaker. i.e. it
> runs when the system needs to reclaim memory. It forceѕ the
> delayed write metadata buffers (of which there can be a lot) to disk
> so that they can be reclaimed on IO completion. This IO submission
> may require ѕome memory to be allocated to be able to free that
> memory.
>
> Hence, AFAICT the use of PF_MEMALLOC is valid here.

Thanks a lot.
I have one additional question, may I ask you?

How can we calculate maximum memory usage in xfsbufd?
I'm afraid that VM and XFS works properly but adding two makes memory exhaust.

And, I conclude XFS doesn't need sharing reservation memory with VM,
it only need non failed allocation. right? IOW I'm prefer perter's
suggestion.

2009-11-18 09:56:35

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

On Wed, 2009-11-18 at 09:01 +0900, Minchan Kim wrote:
> Hi, Peter.
>
> First of all, Thanks for the commenting.
>
> On Wed, Nov 18, 2009 at 5:47 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2009-11-17 at 21:51 +0900, Minchan Kim wrote:
> >> I think it's because mempool reserves memory.
> >> (# of I/O issue\0 is hard to be expected.
> >> How do we determine mempool size of each block driver?
> >> For example, maybe, server use few I/O for nand.
> >> but embedded system uses a lot of I/O.
> >
> > No, you scale the mempool to the minimum amount required to make
> > progress -- this includes limiting the 'concurrency' when handing out
> > mempool objects.
> >
> > If you run into such tight corners often enough to notice it, there's
> > something else wrong.
> >
> > I fully agree with ripping out PF_MEMALLOC from pretty much everything,
> > including the VM, getting rid of the various abuse outside of the VM
> > seems like a very good start.
> >
>
> I am not against removing PF_MEMALLOC.
> Totally, I agree to prevent abusing of PF_MEMALLOC.
>
> What I have a concern is per-block mempool.
> Although it's minimum amount of mempool, it can be increased
> by adding new block driver. I am not sure how many we will have block driver.
>
> And, person who develop new driver always have to use mempool and consider
> what is minimum of mempool.
> I think this is a problem of mempool, now.
>
> How about this?
> According to system memory, kernel have just one mempool for I/O which
> is one shared by several block driver.
>
> And we make new API block driver can use.
> Of course, as usual It can use dynamic memoy. Only it can use mempool if
> system don't have much dynamic memory.
>
> In this case, we can control read/write path. read I/O can't help
> memory reclaiming.
> So I think read I/O don't use mempool, I am not sure. :)

Sure some generic blocklevel infrastructure might work, _but_ you cannot
take away the responsibility of determining the amount of memory needed,
nor does any of this have any merit if you do not limit yourself to that
amount.

Current PF_MEMALLOC usage in the VM is utterly broken in that we can
have a basically unlimited amount of tasks hit direct reclaim and all of
them will then consume PF_MEMALLOC, which mean we can easily run out of
memory.

( unless I missed the direct reclaim throttle patches going in, which
isn't at all impossible )

2009-11-18 10:31:20

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

On Wed, Nov 18, 2009 at 6:56 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2009-11-18 at 09:01 +0900, Minchan Kim wrote:
>> Hi, Peter.
>>
>> First of all, Thanks for the commenting.
>>
>> On Wed, Nov 18, 2009 at 5:47 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, 2009-11-17 at 21:51 +0900, Minchan Kim wrote:
>> >> I think it's because mempool reserves memory.
>> >> (# of I/O issue\0 is hard to be expected.
>> >> How do we determine mempool size of each block driver?
>> >> For example, maybe, server use few I/O for nand.
>> >> but embedded system uses a lot of I/O.
>> >
>> > No, you scale the mempool to the minimum amount required to make
>> > progress -- this includes limiting the 'concurrency' when handing out
>> > mempool objects.
>> >
>> > If you run into such tight corners often enough to notice it, there's
>> > something else wrong.
>> >
>> > I fully agree with ripping out PF_MEMALLOC from pretty much everything,
>> > including the VM, getting rid of the various abuse outside of the VM
>> > seems like a very good start.
>> >
>>
>> I am not against removing PF_MEMALLOC.
>> Totally, I agree to prevent abusing of PF_MEMALLOC.
>>
>> What I have a concern is per-block mempool.
>> Although it's minimum amount of mempool, it can be increased
>> by adding new block driver. I am not sure how many we will have block driver.
>>
>> And, person who develop new driver always have to use mempool and consider
>> what is minimum of mempool.
>> I think this is a problem of mempool, now.
>>
>> How about this?
>> According to system memory, kernel have just one mempool for I/O which
>> is one shared by several block driver.
>>
>> And we make new API block driver can use.
>> Of course, as usual It can use dynamic memoy. Only it can use mempool if
>> system don't have much dynamic memory.
>>
>> In this case, we can control read/write path. read I/O can't help
>> memory reclaiming.
>> So I think read I/O don't use mempool, I am not sure. :)
>
> Sure some generic blocklevel infrastructure might work, _but_ you cannot
> take away the responsibility of determining the amount of memory needed,
> nor does any of this have any merit if you do not limit yourself to that
> amount.

Yes. Some one have to take a responsibility.

The intention was we could take away the responsibility from block driver.
Instead of driver, VM would take the responsibility.

You mean althgouth VM could take the responsiblity, it is hard to
expect amout of pages
needed by block drivers?

Yes, I agree.

>
> Current PF_MEMALLOC usage in the VM is utterly broken in that we can
> have a basically unlimited amount of tasks hit direct reclaim and all of
> them will then consume PF_MEMALLOC, which mean we can easily run out of
> memory.
>
> ( unless I missed the direct reclaim throttle patches going in, which
> isn't at all impossible )

I think we can prevent it at least. Kosaki already submitted the patches. :)
(too_many_isolated functions).

I am looking forward to kosaki's next version.

Thanks for careful comment, Peter.
Thanks for submitting good issue, Kosaki. :)

>
>
>

--
Kind regards,
Minchan Kim

2009-11-18 10:55:01

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

On Wed, 2009-11-18 at 19:31 +0900, Minchan Kim wrote:
> >
> > Sure some generic blocklevel infrastructure might work, _but_ you cannot
> > take away the responsibility of determining the amount of memory needed,
> > nor does any of this have any merit if you do not limit yourself to that
> > amount.
>
> Yes. Some one have to take a responsibility.
>
> The intention was we could take away the responsibility from block driver.
> Instead of driver, VM would take the responsibility.
>
> You mean althgouth VM could take the responsiblity, it is hard to
> expect amout of pages needed by block drivers?

Correct, its near impossible for the VM to accurately guess the amount
of memory needed for a driver, or limit the usage of the driver.

The driver could be very simple in that it'll just start a DMA on the
page and get an interrupt when done, not consuming much (if any) memory
beyond the generic BIO structure, but it could also be some iSCSI
monstrosity which involves the full network stack and userspace.

That is why I generally prefer the user of PF_MEMALLOC to take
responsibility, because it knows its own consumption and can limit its
own consumption.

Now, I don't think (but I could be wring here) that you need to bother
with PF_MEMALLOC unless you're swapping. File based pages should always
be able to free some memory due to the dirty-limit, whcih basically
guarantees that there are some clean file pages for every dirty file
page.

My swap-over-nfs series used to have a block-layer hook to expose the
swap-over-block behaviour:

http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v12.99/blk_queue_swapdev.patch

That gives block devices the power to refuse being swapped over, which
could be an alternative to using PF_MEMALLOC.

2009-11-18 11:15:07

[permalink] [raw]

Subject: Re: [PATCH 2/7] mmc: Don't use PF_MEMALLOC

On Wed, Nov 18, 2009 at 7:54 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2009-11-18 at 19:31 +0900, Minchan Kim wrote:
>> >
>> > Sure some generic blocklevel infrastructure might work, _but_ you cannot
>> > take away the responsibility of determining the amount of memory needed,
>> > nor does any of this have any merit if you do not limit yourself to that
>> > amount.
>>
>> Yes. Some one have to take a responsibility.
>>
>> The intention was we could take away the responsibility from block driver.
>> Instead of driver, VM would take the responsibility.
>>
>> You mean althgouth VM could take the responsiblity, it is hard to
>> expect amout of pages needed by block drivers?
>
> Correct, its near impossible for the VM to accurately guess the amount
> of memory needed for a driver, or limit the usage of the driver.
>
> The driver could be very simple in that it'll just start a DMA on the
> page and get an interrupt when done, not consuming much (if any) memory
> beyond the generic BIO structure, but it could also be some iSCSI
> monstrosity which involves the full network stack and userspace.

Wow, Thanks for good example.
Until now, I don't know iSCSI is such memory hog driver.

> That is why I generally prefer the user of PF_MEMALLOC to take
> responsibility, because it knows its own consumption and can limit its
> own consumption.

Okay. I understand your point by good explanation.

> Now, I don't think (but I could be wring here) that you need to bother
> with PF_MEMALLOC unless you're swapping. File based pages should always
> be able to free some memory due to the dirty-limit, whcih basically
> guarantees that there are some clean file pages for every dirty file
> page.
>
> My swap-over-nfs series used to have a block-layer hook to expose the
> swap-over-block behaviour:
>
> http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v12.99/blk_queue_swapdev.patch
>
> That gives block devices the power to refuse being swapped over, which
> could be an alternative to using PF_MEMALLOC.
>

Thanks for noticing me.
I will look at your patches.
Thanks again, Peter.

--
Kind regards,
Minchan Kim

2009-11-18 22:16:47

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 7/7] xfs: Don't use PF_MEMALLOC

On Wed, Nov 18, 2009 at 05:56:46PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Nov 17, 2009 at 04:23:43PM +0900, KOSAKI Motohiro wrote:
> > >
> > > Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> > > memory, anyone must not prevent it. Otherwise the system cause
> > > mysterious hang-up and/or OOM Killer invokation.
> >
> > The xfsbufd is a woken run by a registered memory shaker. i.e. it
> > runs when the system needs to reclaim memory. It forceѕ the
> > delayed write metadata buffers (of which there can be a lot) to disk
> > so that they can be reclaimed on IO completion. This IO submission
> > may require ѕome memory to be allocated to be able to free that
> > memory.
> >
> > Hence, AFAICT the use of PF_MEMALLOC is valid here.
>
> Thanks a lot.
> I have one additional question, may I ask you?
>
> How can we calculate maximum memory usage in xfsbufd?

It doesn't get calculated at the moment. It is very difficult to
calculate a usable size metric for it because there are multiple
caches (up to 3 per filesystem), and dentry/inode reclaim causes the
size of the cache to grow. Hence the size of the cache is not
really something that can be considered a stable or predictable
input into a "reclaim now" calculation. As such we simply cause
xfsbufd run simultaneously with the shrinkers that cause it to
grow....

> I'm afraid that VM and XFS works properly but adding two makes memory exhaust.

I don't understand what you are trying to say here.

> And, I conclude XFS doesn't need sharing reservation memory with VM,
> it only need non failed allocation. right? IOW I'm prefer perter's
> suggestion.

Right. However, it is worth keeping in mind that this is a
performance critical path for inode reclaim. Hence any throttling
of allocation will slow down the rate at which memory is freed by
the system....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-11-23 15:01:27

by Artem Bityutskiy

[permalink] [raw]

Subject: Re: [PATCH 4/7] nandsim: Don't use PF_MEMALLOC

On Tue, 2009-11-17 at 16:19 +0900, KOSAKI Motohiro wrote:
> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
> memory, anyone must not prevent it. Otherwise the system cause
> mysterious hang-up and/or OOM Killer invokation.
>
> Cc: David Woodhouse <[email protected]>
> Cc: Artem Bityutskiy <[email protected]>
> Cc: [email protected]
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> drivers/mtd/nand/nandsim.c | 22 ++--------------------
> 1 files changed, 2 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
> index cd0711b..97a8bbb 100644
> --- a/drivers/mtd/nand/nandsim.c
> +++ b/drivers/mtd/nand/nandsim.c
> @@ -1322,34 +1322,18 @@ static int get_pages(struct nandsim *ns, struct file *file, size_t count, loff_t
> return 0;
> }
>
> -static int set_memalloc(void)
> -{
> - if (current->flags & PF_MEMALLOC)
> - return 0;
> - current->flags |= PF_MEMALLOC;
> - return 1;
> -}
> -
> -static void clear_memalloc(int memalloc)
> -{
> - if (memalloc)
> - current->flags &= ~PF_MEMALLOC;
> -}
> -
> static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t *pos)
> {
> mm_segment_t old_fs;
> ssize_t tx;
> - int err, memalloc;
> + int err;
>
> err = get_pages(ns, file, count, *pos);
> if (err)
> return err;
> old_fs = get_fs();
> set_fs(get_ds());
> - memalloc = set_memalloc();
> tx = vfs_read(file, (char __user *)buf, count, pos);
> - clear_memalloc(memalloc);
> set_fs(old_fs);
> put_pages(ns);
> return tx;
> @@ -1359,16 +1343,14 @@ static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, size
> {
> mm_segment_t old_fs;
> ssize_t tx;
> - int err, memalloc;
> + int err;
>
> err = get_pages(ns, file, count, *pos);
> if (err)
> return err;
> old_fs = get_fs();
> set_fs(get_ds());
> - memalloc = set_memalloc();
> tx = vfs_write(file, (char __user *)buf, count, pos);
> - clear_memalloc(memalloc);
> set_fs(old_fs);
> put_pages(ns);
> return tx;

I vaguely remember Adrian (CCed) did this on purpose. This is for the
case when nandsim emulates NAND flash on top of a file. So there are 2
file-systems involved: one sits on top of nandsim (e.g. UBIFS) and the
other owns the file which nandsim uses (e.g., ext3).

And I really cannot remember off the top of my head why he needed
PF_MEMALLOC, but I think Adrian wanted to prevent the direct reclaim
path to re-enter, say UBIFS, and cause deadlock. But I'd thing that all
the allocations in vfs_read()/vfs_write() should be GFP_NOFS, so that
should not be a probelm?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-11-23 20:02:11

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 4/7] nandsim: Don't use PF_MEMALLOC

Bityutskiy Artem (Nokia-D/Helsinki) wrote:
> On Tue, 2009-11-17 at 16:19 +0900, KOSAKI Motohiro wrote:
>> Non MM subsystem must not use PF_MEMALLOC. Memory reclaim need few
>> memory, anyone must not prevent it. Otherwise the system cause
>> mysterious hang-up and/or OOM Killer invokation.
>>
>> Cc: David Woodhouse <[email protected]>
>> Cc: Artem Bityutskiy <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> ---
>> drivers/mtd/nand/nandsim.c | 22 ++--------------------
>> 1 files changed, 2 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
>> index cd0711b..97a8bbb 100644
>> --- a/drivers/mtd/nand/nandsim.c
>> +++ b/drivers/mtd/nand/nandsim.c
>> @@ -1322,34 +1322,18 @@ static int get_pages(struct nandsim *ns, struct file *file, size_t count, loff_t
>> return 0;
>> }
>>
>> -static int set_memalloc(void)
>> -{
>> - if (current->flags & PF_MEMALLOC)
>> - return 0;
>> - current->flags |= PF_MEMALLOC;
>> - return 1;
>> -}
>> -
>> -static void clear_memalloc(int memalloc)
>> -{
>> - if (memalloc)
>> - current->flags &= ~PF_MEMALLOC;
>> -}
>> -
>> static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t *pos)
>> {
>> mm_segment_t old_fs;
>> ssize_t tx;
>> - int err, memalloc;
>> + int err;
>>
>> err = get_pages(ns, file, count, *pos);
>> if (err)
>> return err;
>> old_fs = get_fs();
>> set_fs(get_ds());
>> - memalloc = set_memalloc();
>> tx = vfs_read(file, (char __user *)buf, count, pos);
>> - clear_memalloc(memalloc);
>> set_fs(old_fs);
>> put_pages(ns);
>> return tx;
>> @@ -1359,16 +1343,14 @@ static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, size
>> {
>> mm_segment_t old_fs;
>> ssize_t tx;
>> - int err, memalloc;
>> + int err;
>>
>> err = get_pages(ns, file, count, *pos);
>> if (err)
>> return err;
>> old_fs = get_fs();
>> set_fs(get_ds());
>> - memalloc = set_memalloc();
>> tx = vfs_write(file, (char __user *)buf, count, pos);
>> - clear_memalloc(memalloc);
>> set_fs(old_fs);
>> put_pages(ns);
>> return tx;PF_MEMALLOC,
>
> I vaguely remember Adrian (CCed) did this on purpose. This is for the
> case when nandsim emulates NAND flash on top of a file. So there are 2
> file-systems involved: one sits on top of nandsim (e.g. UBIFS) and the
> other owns the file which nandsim uses (e.g., ext3).
>
> And I really cannot remember off the top of my head why he needed
> PF_MEMALLOC, but I think Adrian wanted to prevent the direct reclaim
> path to re-enter, say UBIFS, and cause deadlock. But I'd thing that all
> the allocations in vfs_read()/vfs_write() should be GFP_NOFS, so that
> should not be a probelm?
>

Yes it needs PF_MEMALLOC to prevent deadlock because there can be a
file system on top of nandsim which, in this case, is on top of another
file system.

I do not see how mempools will help here.

Please offer an alternative solution.

2009-11-24 10:47:20

[permalink] [raw]

Subject: Re: [PATCH 4/7] nandsim: Don't use PF_MEMALLOC

Hi

Thank you for this useful comments.

> > I vaguely remember Adrian (CCed) did this on purpose. This is for the
> > case when nandsim emulates NAND flash on top of a file. So there are 2
> > file-systems involved: one sits on top of nandsim (e.g. UBIFS) and the
> > other owns the file which nandsim uses (e.g., ext3).
> >
> > And I really cannot remember off the top of my head why he needed
> > PF_MEMALLOC, but I think Adrian wanted to prevent the direct reclaim
> > path to re-enter, say UBIFS, and cause deadlock. But I'd thing that all
> > the allocations in vfs_read()/vfs_write() should be GFP_NOFS, so that
> > should not be a probelm?
> >
>
> Yes it needs PF_MEMALLOC to prevent deadlock because there can be a
> file system on top of nandsim which, in this case, is on top of another
> file system.
>
> I do not see how mempools will help here.
>
> Please offer an alternative solution.

I have few questions.

Can you please explain more detail? Another stackable filesystam
(e.g. ecryptfs) don't have such problem. Why nandsim have its issue?
What lock cause deadlock?

2009-11-24 11:56:53

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 4/7] nandsim: Don't use PF_MEMALLOC

ext KOSAKI Motohiro wrote:
> Hi
>
> Thank you for this useful comments.
>
>>> I vaguely remember Adrian (CCed) did this on purpose. This is for the
>>> case when nandsim emulates NAND flash on top of a file. So there are 2
>>> file-systems involved: one sits on top of nandsim (e.g. UBIFS) and the
>>> other owns the file which nandsim uses (e.g., ext3).
>>>
>>> And I really cannot remember off the top of my head why he needed
>>> PF_MEMALLOC, but I think Adrian wanted to prevent the direct reclaim
>>> path to re-enter, say UBIFS, and cause deadlock. But I'd thing that all
>>> the allocations in vfs_read()/vfs_write() should be GFP_NOFS, so that
>>> should not be a probelm?
>>>
>> Yes it needs PF_MEMALLOC to prevent deadlock because there can be a
>> file system on top of nandsim which, in this case, is on top of another
>> file system.
>>
>> I do not see how mempools will help here.
>>
>> Please offer an alternative solution.
>
> I have few questions.
>
> Can you please explain more detail? Another stackable filesystam
> (e.g. ecryptfs) don't have such problem. Why nandsim have its issue?
> What lock cause deadlock?
>
>
>

The file systems are not stacked. One is over nandsim, which nandsim
does not know about because it is just a lowly NAND device, and, with
the file cache option, one file system below to provide the file cache.

The deadlock is the kernel writing out dirty pages to the top file system
which writes to nandsim which writes to the bottom file system which
allocates memory which causes dirty pages to be written out to the top
file system, which tries to write to nandsim => deadlock.

2009-11-25 00:42:14

[permalink] [raw]

Subject: Re: [PATCH 4/7] nandsim: Don't use PF_MEMALLOC

> ext KOSAKI Motohiro wrote:
> > Hi
> >
> > Thank you for this useful comments.
> >
> >>> I vaguely remember Adrian (CCed) did this on purpose. This is for the
> >>> case when nandsim emulates NAND flash on top of a file. So there are 2
> >>> file-systems involved: one sits on top of nandsim (e.g. UBIFS) and the
> >>> other owns the file which nandsim uses (e.g., ext3).
> >>>
> >>> And I really cannot remember off the top of my head why he needed
> >>> PF_MEMALLOC, but I think Adrian wanted to prevent the direct reclaim
> >>> path to re-enter, say UBIFS, and cause deadlock. But I'd thing that all
> >>> the allocations in vfs_read()/vfs_write() should be GFP_NOFS, so that
> >>> should not be a probelm?
> >>>
> >> Yes it needs PF_MEMALLOC to prevent deadlock because there can be a
> >> file system on top of nandsim which, in this case, is on top of another
> >> file system.
> >>
> >> I do not see how mempools will help here.
> >>
> >> Please offer an alternative solution.
> >
> > I have few questions.
> >
> > Can you please explain more detail? Another stackable filesystam
> > (e.g. ecryptfs) don't have such problem. Why nandsim have its issue?
> > What lock cause deadlock?
>
> The file systems are not stacked. One is over nandsim, which nandsim
> does not know about because it is just a lowly NAND device, and, with
> the file cache option, one file system below to provide the file cache.
>
> The deadlock is the kernel writing out dirty pages to the top file system
> which writes to nandsim which writes to the bottom file system which
> allocates memory which causes dirty pages to be written out to the top
> file system, which tries to write to nandsim => deadlock.

You mean you want to prevent pageout() instead reclaim itself?
Dropping filecache seems don't make recursive call, right?

2009-11-25 07:13:53

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 4/7] nandsim: Don't use PF_MEMALLOC

KOSAKI Motohiro wrote:
>> KOSAKI Motohiro wrote:
>>> Hi
>>>
>>> Thank you for this useful comments.
>>>
>>>>> I vaguely remember Adrian (CCed) did this on purpose. This is for the
>>>>> case when nandsim emulates NAND flash on top of a file. So there are 2
>>>>> file-systems involved: one sits on top of nandsim (e.g. UBIFS) and the
>>>>> other owns the file which nandsim uses (e.g., ext3).
>>>>>
>>>>> And I really cannot remember off the top of my head why he needed
>>>>> PF_MEMALLOC, but I think Adrian wanted to prevent the direct reclaim
>>>>> path to re-enter, say UBIFS, and cause deadlock. But I'd thing that all
>>>>> the allocations in vfs_read()/vfs_write() should be GFP_NOFS, so that
>>>>> should not be a probelm?
>>>>>
>>>> Yes it needs PF_MEMALLOC to prevent deadlock because there can be a
>>>> file system on top of nandsim which, in this case, is on top of another
>>>> file system.
>>>>
>>>> I do not see how mempools will help here.
>>>>
>>>> Please offer an alternative solution.
>>> I have few questions.
>>>
>>> Can you please explain more detail? Another stackable filesystam
>>> (e.g. ecryptfs) don't have such problem. Why nandsim have its issue?
>>> What lock cause deadlock?
>> The file systems are not stacked. One is over nandsim, which nandsim
>> does not know about because it is just a lowly NAND device, and, with
>> the file cache option, one file system below to provide the file cache.
>>
>> The deadlock is the kernel writing out dirty pages to the top file system
>> which writes to nandsim which writes to the bottom file system which
>> allocates memory which causes dirty pages to be written out to the top
>> file system, which tries to write to nandsim => deadlock.
>
> You mean you want to prevent pageout() instead reclaim itself?

Yes

> Dropping filecache seems don't make recursive call, right?

Yes

2009-11-25 07:18:47