2002-10-24 08:18:28

by Christopher Li

[permalink] [raw]
Subject: writepage return value check in vmscan.c

Hi Andrew,

It might be a silly question.

Is it a bug in vmscan.c that it do not check the return
value of writepage when shrinking the cache?

Some background of the problem:

I was tracing some heavy swapping problem relate to high memory usage
when running many virtual machines in VMware GSX server some time ago.
If the total ram size of different virtual machine add up to some level,
bigger than 2G on a 4Gmachine.Kernel will keep swapping and no response.

VMware open a tmp ram file for each virtual machine and mmap on that file
to share memory between different process.

I soon find out the hidden size limit on shmfs even though the ram file
is open on a ext2/ext3 file system. If I change the ram file open on
/dev/shm then the problem seems to go away. I guess that is because /tmp
directory is full but bigger /tmp did not help. that is another issue
though.

I try to find out what happen if user memory map a sparse file then
kernel try to write it back to disk and hit a no disk space error.
To my surprise, it seems to me that both 2.4 and 2.5 kernel do not
check the return value of "writepage". If there is an error like ENOSPC
it will just drop it on the ground? Do I miss something obvious?

BTW, I am amazed that there is so many way user can abuse the mmap system
call. e.g. open a file, ftruncate to a bigger size, unlink that file while
keep the file descriptor, mmap to some memory using that descriptor,
close that descriptor, you can still use that mmaped memory.

Cheers,

Chris


2002-10-24 08:30:38

by Andrew Morton

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

[email protected] wrote:
>
> Hi Andrew,
>
> It might be a silly question.

It's an excellent question.

> ...
> I try to find out what happen if user memory map a sparse file then
> kernel try to write it back to disk and hit a no disk space error.
> To my surprise, it seems to me that both 2.4 and 2.5 kernel do not
> check the return value of "writepage". If there is an error like ENOSPC
> it will just drop it on the ground? Do I miss something obvious?

Yup. If the kernel cannot write back your MAP_SHARED page due to
ENOSPC it throws your data away.

The alternative would be to allow you to pin an arbitrary amount of
unpageable memory.

A few fixes have been discussed. One way would be to allocate
the space for the page when it is first faulted into reality and
deliver SIGBUS if backing store for it could not be allocated.

> BTW, I am amazed that there is so many way user can abuse the mmap system
> call. e.g. open a file, ftruncate to a bigger size, unlink that file while
> keep the file descriptor, mmap to some memory using that descriptor,
> close that descriptor, you can still use that mmaped memory.

Ayup. MAP_SHARED is a crock. If you want to write to a file, use write().

View MAP_SHARED as a tool by which separate processes can attach
to some shared memory which is identified by the filesystem namespace.
It's not a very good way of performing I/O.

That's not to say that you deserve to have the kernel silently throw
your data away as punishment for having used it though. Thanks for the
prod. We should do something about that.

2002-10-24 08:52:10

by Alan

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, 2002-10-24 at 09:36, Andrew Morton wrote:
> A few fixes have been discussed. One way would be to allocate
> the space for the page when it is first faulted into reality and
> deliver SIGBUS if backing store for it could not be allocated.

You still have to handle the situation where the page goes walkies and
you get ENOSPC or any other ERANDOMSUPRISE from things like NFS. SIGBUS
appears the right thing to do.

2002-10-24 11:24:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 01:36:43AM -0700, Andrew Morton wrote:
> prod. We should do something about that.

you need to preallocate the file, then to mmap it. If you do, the kernel
won't throw the data away. So the fix for vmware is to preallocate the
file and later to mmap it. This way you will be notified by -ENOSPC if
you run out of disk/shmfs space. Other than this I'm not so against the
MAP_SHARED like Andrew, the reason the API is not so clean is that we
cannot have an API at all inside a page fault to notify userspace that
the ram modifications cannot be written to disk. the page fault must be
transparent, there's no retvalue, so if you run out of disk space during
the page fault, the page fault cannot easily tell userspace. As said the
fix is very easy and consists in preallocating the space on disk (I
understand that on shmfs it may not be extremely desiderable since you
may prefer to defer allocation lazily to when you will need the memory
but assuming your allocations are worthwhile it won't make difference
after a few minutes/hours of usage and this way you will trap the -ENOSPC).

As for the task being able to reference a deleted file in memory, that's
true for many other scenarios (the user could leak space by keeping the
fd open and unlinking the file and at the same time to alloc lots of ram
with malloc, the result would be similar), and that's why root will have
to kill these malicious tasks in order to reclaim ram and disk space.

Andrea

2002-10-24 11:38:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 10:15:06AM +0100, Alan Cox wrote:
> On Thu, 2002-10-24 at 09:36, Andrew Morton wrote:
> > A few fixes have been discussed. One way would be to allocate
> > the space for the page when it is first faulted into reality and
> > deliver SIGBUS if backing store for it could not be allocated.
>
> You still have to handle the situation where the page goes walkies and
> you get ENOSPC or any other ERANDOMSUPRISE from things like NFS. SIGBUS
> appears the right thing to do.

I would tend to agree SIGBUS could be the right thing to do since the
other (current) option is silent data corruption.

Andrea

2002-10-24 16:06:11

by Andrew Morton

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

Andrea Arcangeli wrote:
>
> On Thu, Oct 24, 2002 at 10:15:06AM +0100, Alan Cox wrote:
> > On Thu, 2002-10-24 at 09:36, Andrew Morton wrote:
> > > A few fixes have been discussed. One way would be to allocate
> > > the space for the page when it is first faulted into reality and
> > > deliver SIGBUS if backing store for it could not be allocated.
> >
> > You still have to handle the situation where the page goes walkies and
> > you get ENOSPC or any other ERANDOMSUPRISE from things like NFS. SIGBUS
> > appears the right thing to do.
>
> I would tend to agree SIGBUS could be the right thing to do since the
> other (current) option is silent data corruption.
>

Or at least remember the data loss within the mapping for a subsequent
msync/fsync operation.

We'd need a similar thing for detecting write I/O errors too.

write(fd, data);
sleep(60);
fsync(fd); -> doesn't report write errors.

But that's all filed under "bug fixes" and can be done after you-know-when.

2002-10-24 17:52:43

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 10:15:06AM +0100, Alan Cox wrote:
> On Thu, 2002-10-24 at 09:36, Andrew Morton wrote:
> > A few fixes have been discussed. One way would be to allocate
> > the space for the page when it is first faulted into reality and
> > deliver SIGBUS if backing store for it could not be allocated.
>
> You still have to handle the situation where the page goes walkies and
> you get ENOSPC or any other ERANDOMSUPRISE from things like NFS. SIGBUS
> appears the right thing to do.
>
exactly. It need to preserve the page dirty in the first place.

Chris


2002-10-24 17:50:44

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 01:36:43AM -0700, Andrew Morton wrote:
>
> Yup. If the kernel cannot write back your MAP_SHARED page due to
> ENOSPC it throws your data away.
>
> The alternative would be to allow you to pin an arbitrary amount of
> unpageable memory.

I know the error handling in mmaped memory is poor. But I am not talking
about that one. There are two place the mmaped memory can flush back
to disk. The one you are talking about is filemap_fdatasync() in filemap.c.
It will be called when try to unmmap or sync back to the disk. It at least
check the err when writepage fail. But it still clear the page dirty anyway,
looks bad to me.

The one I am complaining about is in vmscan.c, when kswapd try to
shink the cache. Correct me if I am wrong. kswapd will flush some mmaped
page back to disk to release some page from page cache. Even when
write page fail. Kernel will still do:

ClearPageDirty(page);
SetPageLaunder(page);

for that page. So this page will be able to used by other process.
When it have a missing page fault, it will read back the WRONG
data from the disk and cause a memory corruption.

So I am expecting something like tihs:

if ((gfp_mask & __GFP_FS) && writepage) {
+ unsigned long flags = page->flags;

ClearPageDirty(page);
SetPageLaunder(page);
page_cache_get(page);
spin_unlock(&pagemap_lru_lock);

- writepage(page);
+ if (writepage(page))
+ page->flags = flags;

page_cache_release(page);

spin_lock(&pagemap_lru_lock);
continue;
}

>
> A few fixes have been discussed. One way would be to allocate
> the space for the page when it is first faulted into reality and
> deliver SIGBUS if backing store for it could not be allocated.

I am not sure how the user program handle that signal...

>
> Ayup. MAP_SHARED is a crock. If you want to write to a file, use write().
>
> View MAP_SHARED as a tool by which separate processes can attach
> to some shared memory which is identified by the filesystem namespace.
> It's not a very good way of performing I/O.

That is exactly the case for vmware ram file. VMware only use it to share
memory. Those are the virtual machine's memory. We don't want to write
it back to disk and we don't care what is left on the file system because
when vmware exit, we will throw the guest ram data away just like a real
machine power off ram will lost. We are not talking about machine using
flash ram :-).

It is kswapd try to flush the data and it should take response to handle
the error. If it fail, one thing it should do is keep the page dirty
if write back fail. At least not corrupt memory like that.

If we can deliver the error to user program that would be a plus.
But this need to be fix frist.

Chris


2002-10-24 18:23:39

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 01:31:06PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 24, 2002 at 01:36:43AM -0700, Andrew Morton wrote:
>
> you need to preallocate the file, then to mmap it. If you do, the kernel
> won't throw the data away. So the fix for vmware is to preallocate the
> file and later to mmap it. This way you will be notified by -ENOSPC if
> you run out of disk/shmfs space. Other than this I'm not so against the
> MAP_SHARED like Andrew, the reason the API is not so clean is that we
> cannot have an API at all inside a page fault to notify userspace that
> the ram modifications cannot be written to disk. the page fault must be
> transparent, there's no retvalue, so if you run out of disk space during
> the page fault, the page fault cannot easily tell userspace. As said the
> fix is very easy and consists in preallocating the space on disk (I
> understand that on shmfs it may not be extremely desiderable since you
> may prefer to defer allocation lazily to when you will need the memory
> but assuming your allocations are worthwhile it won't make difference
> after a few minutes/hours of usage and this way you will trap the -ENOSPC).

But preallocate the vmware ram file on disk is too expensive. It will slow
down the guest OS boot up a lot. Many user measure how fast vmware is by
counting how many seconds it takes to boot a windows guest for example.
For those virtual machine which have 2G or ram, how long does it take
to write a file with 2G of data?

>
> As for the task being able to reference a deleted file in memory, that's
> true for many other scenarios (the user could leak space by keeping the
> fd open and unlinking the file and at the same time to alloc lots of ram
> with malloc, the result would be similar), and that's why root will have
> to kill these malicious tasks in order to reclaim ram and disk space.

vmware is definitely one of those malicious task ;-)

Chris


2002-10-24 18:27:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 10:57:18AM -0700, [email protected] wrote:
> if ((gfp_mask & __GFP_FS) && writepage) {
> + unsigned long flags = page->flags;
>
> ClearPageDirty(page);
> SetPageLaunder(page);
> page_cache_get(page);
> spin_unlock(&pagemap_lru_lock);
>
> - writepage(page);
> + if (writepage(page))
> + page->flags = flags;
>
> page_cache_release(page);
>
> spin_lock(&pagemap_lru_lock);
> continue;
> }

side note, you should use atomic bitflag operations here or you risk to
lose a bit set by another cpu between the read and the write. you
basically meant SetPageDirty() if writepage fails. That is supposed to
happen in the lowlevel layer (like in fail_writepage) but the problem
here is that this isn't ramfs, and block_write_full_page could left
locked in ram lots of pages if it would disallow these pages to be
discared from the vm.

> > A few fixes have been discussed. One way would be to allocate
> > the space for the page when it is first faulted into reality and
> > deliver SIGBUS if backing store for it could not be allocated.
>
> I am not sure how the user program handle that signal...
>
> >
> > Ayup. MAP_SHARED is a crock. If you want to write to a file, use write().
> >
> > View MAP_SHARED as a tool by which separate processes can attach
> > to some shared memory which is identified by the filesystem namespace.
> > It's not a very good way of performing I/O.
>
> That is exactly the case for vmware ram file. VMware only use it to share
> memory. Those are the virtual machine's memory. We don't want to write
> it back to disk and we don't care what is left on the file system because
> when vmware exit, we will throw the guest ram data away just like a real
> machine power off ram will lost. We are not talking about machine using
> flash ram :-).
>
> It is kswapd try to flush the data and it should take response to handle
> the error. If it fail, one thing it should do is keep the page dirty
> if write back fail. At least not corrupt memory like that.
>
> If we can deliver the error to user program that would be a plus.
> But this need to be fix frist.

as said this cannot be fixed easily in kernel, or it would be trivial to
lockup a machine by filling the fs changing the i_size of a file and by
marking all ram in the machine dirty in the hole, the vm must be allowed
to discard those pages and invaliding those posted writes. At least
until a true solution will be available you should change vmware to
preallocate the file, then it will work fine because you will catch the
ENOSPC error during the preallocation. If you work on shmfs that will be
very quick indeed.

Andrea

2002-10-24 18:33:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 11:30:24AM -0700, [email protected] wrote:
> On Thu, Oct 24, 2002 at 01:31:06PM +0200, Andrea Arcangeli wrote:
> > On Thu, Oct 24, 2002 at 01:36:43AM -0700, Andrew Morton wrote:
> >
> > you need to preallocate the file, then to mmap it. If you do, the kernel
> > won't throw the data away. So the fix for vmware is to preallocate the
> > file and later to mmap it. This way you will be notified by -ENOSPC if
> > you run out of disk/shmfs space. Other than this I'm not so against the
> > MAP_SHARED like Andrew, the reason the API is not so clean is that we
> > cannot have an API at all inside a page fault to notify userspace that
> > the ram modifications cannot be written to disk. the page fault must be
> > transparent, there's no retvalue, so if you run out of disk space during
> > the page fault, the page fault cannot easily tell userspace. As said the
> > fix is very easy and consists in preallocating the space on disk (I
> > understand that on shmfs it may not be extremely desiderable since you
> > may prefer to defer allocation lazily to when you will need the memory
> > but assuming your allocations are worthwhile it won't make difference
> > after a few minutes/hours of usage and this way you will trap the -ENOSPC).
>
> But preallocate the vmware ram file on disk is too expensive. It will slow
> down the guest OS boot up a lot. Many user measure how fast vmware is by
> counting how many seconds it takes to boot a windows guest for example.
> For those virtual machine which have 2G or ram, how long does it take
> to write a file with 2G of data?

unfortunately I see no way around it and patching the kernel to loop
forever on dirty pages that may never be possible to write doesn't look
safe. You could check the free space on the fs and bug the user if it
has less than 2G free (still it's not 100% reliable, it's a racy check,
but you could also add a 100% reliable option that slowdown the startup
of the vm but that guarantees no corruption can happen).

Furthmore if your machine has 2G of data you're likely to have >2G of
ram and shmfs should be quick allocating 2G in such case, maybe 2/3
seconds?

Andrea

2002-10-24 19:08:13

by Rik van Riel

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, 24 Oct 2002, Andrea Arcangeli wrote:

> unfortunately I see no way around it and patching the kernel to loop
> forever on dirty pages that may never be possible to write doesn't look
> safe. You could check the free space on the fs and bug the user if it
> has less than 2G free (still it's not 100% reliable, it's a racy check,
> but you could also add a 100% reliable option that slowdown the startup
> of the vm but that guarantees no corruption can happen).

We need space allocation. Not just for this (probably rare) case,
but also for the more generic optimisation of delayed allocation.

cheers,

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-10-24 19:08:53

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 08:33:27PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 24, 2002 at 10:57:18AM -0700, [email protected] wrote:
> > if ((gfp_mask & __GFP_FS) && writepage) {
> > + unsigned long flags = page->flags;
> >
> > ClearPageDirty(page);
> > SetPageLaunder(page);
> > page_cache_get(page);
> > spin_unlock(&pagemap_lru_lock);
> >
> > - writepage(page);
> > + if (writepage(page))
> > + page->flags = flags;
> >
> > page_cache_release(page);
> >
> > spin_lock(&pagemap_lru_lock);
> > continue;
> > }
>
> side note, you should use atomic bitflag operations here or you risk to
> lose a bit set by another cpu between the read and the write. you

Thanks. I am just shooting in dark.

> basically meant SetPageDirty() if writepage fails. That is supposed to
> happen in the lowlevel layer (like in fail_writepage) but the problem
> here is that this isn't ramfs, and block_write_full_page could left
> locked in ram lots of pages if it would disallow these pages to be
> discared from the vm.

Exactly.

>
> > > A few fixes have been discussed. One way would be to allocate
> > > the space for the page when it is first faulted into reality and
> > > deliver SIGBUS if backing store for it could not be allocated.
> >
> > I am not sure how the user program handle that signal...
> >
> > >
> > > Ayup. MAP_SHARED is a crock. If you want to write to a file, use write().
> > >
> > > View MAP_SHARED as a tool by which separate processes can attach
> > > to some shared memory which is identified by the filesystem namespace.
> > > It's not a very good way of performing I/O.
> >
> > That is exactly the case for vmware ram file. VMware only use it to share
> > memory. Those are the virtual machine's memory. We don't want to write
> > it back to disk and we don't care what is left on the file system because
> > when vmware exit, we will throw the guest ram data away just like a real
> > machine power off ram will lost. We are not talking about machine using
> > flash ram :-).
> >
> > It is kswapd try to flush the data and it should take response to handle
> > the error. If it fail, one thing it should do is keep the page dirty
> > if write back fail. At least not corrupt memory like that.
> >
> > If we can deliver the error to user program that would be a plus.
> > But this need to be fix frist.
>
> as said this cannot be fixed easily in kernel, or it would be trivial to
> lockup a machine by filling the fs changing the i_size of a file and by
> marking all ram in the machine dirty in the hole, the vm must be allowed

Yes, but even now days it will able to lockup machine by doing that.

Try the test bigmm program I attach to this mail. It will simulate vmware's
memory mapping. It can easily lockup the machine even though there is
enough disk space.

See the comment at the source for parameter. basically, if you want
3 virtual machine, each have 2 process, using 1 G ram each you can do:

bigmm -i 3 -t 2 -c 1024

I run it on two 4G and 8G smp machine. Both can dead lock if I mmap
enough memory.

I haven't try it on the latest kernel yet. But last time I try it,
it works every time. I have to reset the machine. I mean ram file
create on normal file system.

But if I create it on /dev/shm, the kernel can correctly kill
some of the process and free the memory.

Prepare to reset the machine if you try that, you have been warned :-)


> to discard those pages and invaliding those posted writes. At least
> until a true solution will be available you should change vmware to
> preallocate the file, then it will work fine because you will catch the
> ENOSPC error during the preallocation. If you work on shmfs that will be
> very quick indeed.

Yes, shmfs seems to be the only choice so far.

Chris


Attachments:
(No filename) (3.76 kB)
bigmm.c (2.25 kB)
Download all attachments

2002-10-24 19:19:15

by Andrew Morton

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

Rik van Riel wrote:
>
> On Thu, 24 Oct 2002, Andrea Arcangeli wrote:
>
> > unfortunately I see no way around it and patching the kernel to loop
> > forever on dirty pages that may never be possible to write doesn't look
> > safe. You could check the free space on the fs and bug the user if it
> > has less than 2G free (still it's not 100% reliable, it's a racy check,
> > but you could also add a 100% reliable option that slowdown the startup
> > of the vm but that guarantees no corruption can happen).
>
> We need space allocation. Not just for this (probably rare) case,
> but also for the more generic optimisation of delayed allocation.
>

Well that's certainly the other option. Dig out the old
a_ops->reservepage stuff.

It _was_ Halloween 2003, wasn't it??

It would only work for filesystems which implement reservation
though, and iirc there were nasty problems doing delayed
allocation in, for example, ext3. I guess ext3 would have to
reserve journal space as well as disk space. ext2 delayed
allocation was pretty straightforward though, and most of
the infrastructure which it needed is there now. The replacement
of buffer-based writeback with page-based, mainly.

2002-10-24 20:40:14

by Andrew Morton

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

[email protected] wrote:
>
> ...
> See the comment at the source for parameter. basically, if you want
> 3 virtual machine, each have 2 process, using 1 G ram each you can do:
>
> bigmm -i 3 -t 2 -c 1024
>
> I run it on two 4G and 8G smp machine. Both can dead lock if I mmap
> enough memory.
>

Are you sure it's a deadlock? A large MAP_SHARED load like this
on a 2.4 highmem machine can go into a spin, but it will come back
to life after several minutes.

2002-10-24 20:52:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 12:15:32PM -0700, [email protected] wrote:
> Yes, but even now days it will able to lockup machine by doing that.
>
> Try the test bigmm program I attach to this mail. It will simulate vmware's
> memory mapping. It can easily lockup the machine even though there is
> enough disk space.
>
> See the comment at the source for parameter. basically, if you want
> 3 virtual machine, each have 2 process, using 1 G ram each you can do:
>
> bigmm -i 3 -t 2 -c 1024
>
> I run it on two 4G and 8G smp machine. Both can dead lock if I mmap
> enough memory.

I run the above command on my laptop with 256M of ram and 1G of swap
with kde running (though idle) and the task was correctly killed:

Oct 24 22:29:32 x30 kernel: VM: killing process bigmm

the machine never deadlocked. Probably it's one of the oom deadlocks
that I fixed in my 2.4 -aa tree and that the oom killer heuristic in
mainline cannot figure out. Please try to reproduce with 2.4.20pre11aa1.
thanks.

> Prepare to reset the machine if you try that, you have been warned :-)

If you're running an oom deadlock prone kernel.

> > to discard those pages and invaliding those posted writes. At least
> > until a true solution will be available you should change vmware to
> > preallocate the file, then it will work fine because you will catch the
> > ENOSPC error during the preallocation. If you work on shmfs that will be
> > very quick indeed.
>
> Yes, shmfs seems to be the only choice so far.

Agreed.

Andrea

2002-10-24 21:10:57

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 10:41:08PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 24, 2002 at 12:15:32PM -0700, [email protected] wrote:
> > Yes, but even now days it will able to lockup machine by doing that.
> >
> > Try the test bigmm program I attach to this mail. It will simulate vmware's
> > memory mapping. It can easily lockup the machine even though there is
> > enough disk space.
> >
> > See the comment at the source for parameter. basically, if you want
> > 3 virtual machine, each have 2 process, using 1 G ram each you can do:
> >
> > bigmm -i 3 -t 2 -c 1024
> >
> > I run it on two 4G and 8G smp machine. Both can dead lock if I mmap
> > enough memory.
>
> I run the above command on my laptop with 256M of ram and 1G of swap
> with kde running (though idle) and the task was correctly killed:

That is probably too small ram to start with. What can you expect
to ask for 3G on a 1/4 G machine?

>
> Oct 24 22:29:32 x30 kernel: VM: killing process bigmm
>
> the machine never deadlocked. Probably it's one of the oom deadlocks
> that I fixed in my 2.4 -aa tree and that the oom killer heuristic in
> mainline cannot figure out. Please try to reproduce with 2.4.20pre11aa1.
> thanks.

I will definitely try it when I can use that big memory machine again.
Other people is running (windows) test on it right now.

>
> > Prepare to reset the machine if you try that, you have been warned :-)
>
> If you're running an oom deadlock prone kernel.

When it dies, it is not deadlock though. Hard disk keep spinning.
It looks like dead loop in swapping and not response to any thing else.
But one thing for sure, OOM did not kill it correctly.

Chris


2002-10-24 21:16:25

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, Oct 24, 2002 at 01:46:19PM -0700, Andrew Morton wrote:
> [email protected] wrote:
> >
> > ...
> > See the comment at the source for parameter. basically, if you want
> > 3 virtual machine, each have 2 process, using 1 G ram each you can do:
> >
> > bigmm -i 3 -t 2 -c 1024
> >
> > I run it on two 4G and 8G smp machine. Both can dead lock if I mmap
> > enough memory.
> >
>
> Are you sure it's a deadlock? A large MAP_SHARED load like this

deadlock is the wrong word. Its harddisk keep spinning and not
response to anything.

> on a 2.4 highmem machine can go into a spin, but it will come back
> to life after several minutes.

No, it will not come back to life, at least not after several minutes.
And there is not sign it is going to come back to life.

Chris


2002-10-24 21:23:46

by Andrew Morton

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

[email protected] wrote:
>
> > on a 2.4 highmem machine can go into a spin, but it will come back
> > to life after several minutes.
>
> No, it will not come back to life, at least not after several minutes.
> And there is not sign it is going to come back to life.

A 2.5G machine would, iirc, spin for 3-5 minutes.

Umm, probably the time would increase somewhat exponentially
with memory size so yes, you could be in for a very long wait.

-ac kernels have an lru per zone and so would not be bitten
by this failure. If indeed you are striking this problem,
which is described at
http://mail.nl.linux.org/linux-mm/2002-08/msg00049.html

2002-10-25 16:14:55

by Paul Larson

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, 2002-10-24 at 16:29, Andrew Morton wrote:
> -ac kernels have an lru per zone and so would not be bitten
> by this failure. If indeed you are striking this problem,
> which is described at
> http://mail.nl.linux.org/linux-mm/2002-08/msg00049.html
Is it the 2.4 or 2.5 (or both) ac kernels that have the per zone lru? I
have some stuff I'd like to try with that.

Thanks,
Paul Larson

2002-10-25 16:25:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Fri, Oct 25, 2002 at 11:11:41AM -0500, Paul Larson wrote:
> On Thu, 2002-10-24 at 16:29, Andrew Morton wrote:
> > -ac kernels have an lru per zone and so would not be bitten
> > by this failure. If indeed you are striking this problem,
> > which is described at
> > http://mail.nl.linux.org/linux-mm/2002-08/msg00049.html
> Is it the 2.4 or 2.5 (or both) ac kernels that have the per zone lru? I
> have some stuff I'd like to try with that.

2.4-rmap, 2.4-ac and any 2.5 (including -ac).

2002-10-25 17:01:03

by Rik van Riel

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On 25 Oct 2002, Paul Larson wrote:
> On Thu, 2002-10-24 at 16:29, Andrew Morton wrote:
> > http://mail.nl.linux.org/linux-mm/2002-08/msg00049.html
> Is it the 2.4 or 2.5 (or both) ac kernels that have the per zone lru?
> I have some stuff I'd like to try with that.

2.4-rmap
2.4-ac
2.5 all

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-10-25 18:38:08

by Andrew Morton

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

[email protected] wrote:
>
> bigmm -i 3 -t 2 -c 1024

That's a nice little box killer you have there.

With mem=4G, running bigmm -i 5 -t 2 -c 1024:

2.4.19: Ran for a few minutes, got slower and slower and
eventually stopped. kupdate had taken 30 seconds CPU and
all CPUs were spinning in shrink_cache(). Had to reset.

2.4.20-pre8-ac1: Ran for a minute, froze up for a couple of
minutes then recovered and remained comfortable.

2.5.44-mm5: had a few 0.5-second stalls in vmstat output, no
other problems.

It's probably the list search failure, but I can't say for sure
at this time.

2002-10-28 08:22:29

by Christoph Rohland

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Thu, 24 Oct 2002, [email protected] wrote:
> Yes, shmfs seems to be the only choice so far.

So why don't you use Posix or SYSV shared mem?

Greetings
Christoph


2002-10-28 18:36:55

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

They are the same as shmfs to linux kernel. Why does vmware not use it
in the first place? It is possible due to some the history reason.

BTW, I have another question. For the 8G memory machine, do we need
to setup 16G swap space? Think about the time it take to write 16G
data, does it still make sense that swap space is twice as big as
memory?

And the swap partition has limit as 2G. So we need to setup 8 swap
partitions if we want 16G swap.

Thanks

Chris

On Mon, Oct 28, 2002 at 09:28:22AM +0100, Christoph Rohland wrote:
> On Thu, 24 Oct 2002, [email protected] wrote:
> > Yes, shmfs seems to be the only choice so far.
>
> So why don't you use Posix or SYSV shared mem?
>
> Greetings
> Christoph
>
>


2002-10-28 19:22:26

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Mon, Oct 28, 2002 at 08:22:14PM +0100, Andrea Arcangeli wrote:
>
> swap space doesn't need to be twice as big as ram. That's fixed long
> ago.
>
> swap+ram is the total amount of virtual memory that you can use in
> vmware.

Cool.

>
> >
> > And the swap partition has limit as 2G. So we need to setup 8 swap
> > partitions if we want 16G swap.
>
> that's a silly restriction of mkswap, the kernel doesn't care, it can
> handle way more than 2G (however there's an high bound at some
> unpractical level, to go safe the math limit should be re-encoded in
> mkswap, of course it changes for every arch because the pte layout is
> different).

Thanks

Chris



2002-10-28 19:10:16

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Fri, Oct 25, 2002 at 11:44:14AM -0700, Andrew Morton wrote:
> [email protected] wrote:
> >
> > bigmm -i 3 -t 2 -c 1024
>
> That's a nice little box killer you have there.

Thanks. It kills on all our customer's kernel, they don't use the
bleeding edge kernel at all. It is interesting to see vmware
serve as some heavy load stress test tool. It will give some real
world load to the OS, e.g. the load need to boot a windows etc. You
can stack many of them to abuse the OS.

>
> With mem=4G, running bigmm -i 5 -t 2 -c 1024:
>
> 2.4.19: Ran for a few minutes, got slower and slower and
> eventually stopped. kupdate had taken 30 seconds CPU and
> all CPUs were spinning in shrink_cache(). Had to reset.
>
> 2.4.20-pre8-ac1: Ran for a minute, froze up for a couple of
> minutes then recovered and remained comfortable.

How many instance of bigmm left there? It should be 10 bigmm
processes before oom kickin.

>
> 2.5.44-mm5: had a few 0.5-second stalls in vmstat output, no
> other problems.
>
> It's probably the list search failure, but I can't say for sure
> at this time.

Chris


2002-10-28 19:16:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Mon, Oct 28, 2002 at 10:44:20AM -0800, [email protected] wrote:
> They are the same as shmfs to linux kernel. Why does vmware not use it
> in the first place? It is possible due to some the history reason.
>
> BTW, I have another question. For the 8G memory machine, do we need
> to setup 16G swap space? Think about the time it take to write 16G
> data, does it still make sense that swap space is twice as big as
> memory?

swap space doesn't need to be twice as big as ram. That's fixed long
ago.

swap+ram is the total amount of virtual memory that you can use in
vmware.

>
> And the swap partition has limit as 2G. So we need to setup 8 swap
> partitions if we want 16G swap.

that's a silly restriction of mkswap, the kernel doesn't care, it can
handle way more than 2G (however there's an high bound at some
unpractical level, to go safe the math limit should be re-encoded in
mkswap, of course it changes for every arch because the pte layout is
different).

Andrea

2002-10-28 19:51:02

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

Hi Andrea,

On Thu, Oct 24, 2002 at 08:33:27PM +0200, Andrea Arcangeli wrote:
> >
> > That is exactly the case for vmware ram file. VMware only use it to share
> > memory. Those are the virtual machine's memory. We don't want to write
> > it back to disk and we don't care what is left on the file system because
> > when vmware exit, we will throw the guest ram data away just like a real
> > machine power off ram will lost. We are not talking about machine using
> > flash ram :-).
> >
> > It is kswapd try to flush the data and it should take response to handle
> > the error. If it fail, one thing it should do is keep the page dirty
> > if write back fail. At least not corrupt memory like that.
> >
> > If we can deliver the error to user program that would be a plus.
> > But this need to be fix frist.
>
> as said this cannot be fixed easily in kernel, or it would be trivial to
> lockup a machine by filling the fs changing the i_size of a file and by
> marking all ram in the machine dirty in the hole, the vm must be allowed
> to discard those pages and invaliding those posted writes. At least
> until a true solution will be available you should change vmware to
> preallocate the file, then it will work fine because you will catch the
> ENOSPC error during the preallocation. If you work on shmfs that will be
> very quick indeed.

I still think throwing process's page away if write fail is bad.
If kernel drop the data, that process is not able to run correctly
anyway. Why not keep the page here and let oom killer to pick up the
process to kill. In this way, we at least have some process able to run
correctly instead of every process which hit the out of disk space has
some bad data.

Cheers

Chris



2002-10-28 19:47:20

by Andrew Morton

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

[email protected] wrote:
>
> On Fri, Oct 25, 2002 at 11:44:14AM -0700, Andrew Morton wrote:
> > [email protected] wrote:
> > >
> > > bigmm -i 3 -t 2 -c 1024
> >
> > That's a nice little box killer you have there.
>
> Thanks. It kills on all our customer's kernel, they don't use the
> bleeding edge kernel at all. It is interesting to see vmware
> serve as some heavy load stress test tool. It will give some real
> world load to the OS, e.g. the load need to boot a windows etc. You
> can stack many of them to abuse the OS.

I tested Andrea's latest kernel. It survived.

Probably because it left 100 megabytes of lowmem unallocated
throughout the test.

> >
> > With mem=4G, running bigmm -i 5 -t 2 -c 1024:
> >
> > 2.4.19: Ran for a few minutes, got slower and slower and
> > eventually stopped. kupdate had taken 30 seconds CPU and
> > all CPUs were spinning in shrink_cache(). Had to reset.
> >
> > 2.4.20-pre8-ac1: Ran for a minute, froze up for a couple of
> > minutes then recovered and remained comfortable.
>
> How many instance of bigmm left there? It should be 10 bigmm
> processes before oom kickin.

Well, they should all be left running? All this memory has
file-backing, and is easily reclaimable.

umm, yes. There could be bogus oom-killings in the combined-LRU
VMs. But I saw none in testing.



All of which is great fun, but it leaves open the question "what
the heck can vmware do about it". I wish there was a clear answer.

If the customer is running a suse/UL kernel they're presumably OK.

If their kernel comes from kernel.org they should add Andrea's patch.
Which means they get an absolute boatload of stuff which they may
not want:
1223 files changed, 306053 insertions(+), 9655 deletions(-)
but that kernel performs well.

If they're running an RH-rmap kernel then they're probably okayish,
although I'd recommend more testing there.

If they're running an RHAS-style kernel then I do not know. It may
fail.

2002-10-28 20:31:23

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Mon, Oct 28, 2002 at 11:53:28AM -0800, Andrew Morton wrote:
> [email protected] wrote:
> >
> > On Fri, Oct 25, 2002 at 11:44:14AM -0700, Andrew Morton wrote:
> > > [email protected] wrote:
> > > >
> > > > bigmm -i 3 -t 2 -c 1024
> > >
> > > That's a nice little box killer you have there.
> >
>
> I tested Andrea's latest kernel. It survived.

great. I can try the hundred-win2k-vm test on linux some day and
see what happen.

>
> Probably because it left 100 megabytes of lowmem unallocated
> throughout the test.
>
> > >
> > > With mem=4G, running bigmm -i 5 -t 2 -c 1024:
> > >
> > > 2.4.19: Ran for a few minutes, got slower and slower and
> > > eventually stopped. kupdate had taken 30 seconds CPU and
> > > all CPUs were spinning in shrink_cache(). Had to reset.
> > >
> > > 2.4.20-pre8-ac1: Ran for a minute, froze up for a couple of
> > > minutes then recovered and remained comfortable.
> >
> > How many instance of bigmm left there? It should be 10 bigmm
> > processes before oom kickin.
>
> Well, they should all be left running? All this memory has
> file-backing, and is easily reclaimable.

In my experiment, if some bigmm get killed by oom killer.

>
> umm, yes. There could be bogus oom-killings in the combined-LRU
> VMs. But I saw none in testing.
>
>
>
> All of which is great fun, but it leaves open the question "what
> the heck can vmware do about it". I wish there was a clear answer.

We told them to dump the ram file at /dev/shm and prepare a large
swap space. And be gentle to linux, don't push vm's total ram
beyond what is on the machine. It is running OK so far.

We did try to recomment some SuSE or Redhat new kernel. It did not
work at that time. But I am not surprised at all, it is something
like 2.4.7 based. We have this problem on linux for some time now.

>
> If the customer is running a suse/UL kernel they're presumably OK.
>
> If their kernel comes from kernel.org they should add Andrea's patch.
> Which means they get an absolute boatload of stuff which they may
> not want:
> 1223 files changed, 306053 insertions(+), 9655 deletions(-)
> but that kernel performs well.

Patch kernel and ask them to build kernel is not an option at all.
They just want something reliable. They trust most major linux
distribution's kenrel just don't home make one.

>
> If they're running an RH-rmap kernel then they're probably okayish,
> although I'd recommend more testing there.

That I don't know.

>
> If they're running an RHAS-style kernel then I do not know. It may
> fail.

Chris


2002-10-28 21:08:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Mon, Oct 28, 2002 at 11:53:28AM -0800, Andrew Morton wrote:
> [email protected] wrote:
> >
> > On Fri, Oct 25, 2002 at 11:44:14AM -0700, Andrew Morton wrote:
> > > [email protected] wrote:
> > > >
> > > > bigmm -i 3 -t 2 -c 1024
> > >
> > > That's a nice little box killer you have there.
> >
> > Thanks. It kills on all our customer's kernel, they don't use the
> > bleeding edge kernel at all. It is interesting to see vmware
> > serve as some heavy load stress test tool. It will give some real
> > world load to the OS, e.g. the load need to boot a windows etc. You
> > can stack many of them to abuse the OS.
>
> I tested Andrea's latest kernel. It survived.
>
> Probably because it left 100 megabytes of lowmem unallocated
> throughout the test.

that's unrelated to the vm code though (in terms of per-zone lru
mentioned by Andrew for 2.5 that in turns breaks all the aging
information compared to 2.4), that is meant to definitely fix another
highmem unbalance bug where all the lowmem could be pinned and made
unfreeable by lowmem users despite plenty of highmem is still available.
That's the fix to the google mlock deadlock, mainline has a weak attempt
to fix it in another manner, but I backed it out since it's too weak to
be effective in the real workloads (it didn't fix the problem in real
life) and I instead kept my first approch that is THE definitive fix.
btw, 2.5 has still the weak approch, so it's still subject to the google
bug, I will fix it in my tree while moving to 2.5.

> If the customer is running a suse/UL kernel they're presumably OK.

Right.

Andrea

(PS about your previous comment on the swap needed: kernels <= 2.4.9
(9 != 19) also have the problem that vm+swap isn't all the vm that vmware
can use, for them you need the double of swap)

2002-10-28 21:26:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Mon, Oct 28, 2002 at 11:58:31AM -0800, [email protected] wrote:
> Hi Andrea,
>
> On Thu, Oct 24, 2002 at 08:33:27PM +0200, Andrea Arcangeli wrote:
> > >
> > > That is exactly the case for vmware ram file. VMware only use it to share
> > > memory. Those are the virtual machine's memory. We don't want to write
> > > it back to disk and we don't care what is left on the file system because
> > > when vmware exit, we will throw the guest ram data away just like a real
> > > machine power off ram will lost. We are not talking about machine using
> > > flash ram :-).
> > >
> > > It is kswapd try to flush the data and it should take response to handle
> > > the error. If it fail, one thing it should do is keep the page dirty
> > > if write back fail. At least not corrupt memory like that.
> > >
> > > If we can deliver the error to user program that would be a plus.
> > > But this need to be fix frist.
> >
> > as said this cannot be fixed easily in kernel, or it would be trivial to
> > lockup a machine by filling the fs changing the i_size of a file and by
> > marking all ram in the machine dirty in the hole, the vm must be allowed
> > to discard those pages and invaliding those posted writes. At least
> > until a true solution will be available you should change vmware to
> > preallocate the file, then it will work fine because you will catch the
> > ENOSPC error during the preallocation. If you work on shmfs that will be
> > very quick indeed.
>
> I still think throwing process's page away if write fail is bad.
> If kernel drop the data, that process is not able to run correctly
> anyway. Why not keep the page here and let oom killer to pick up the
> process to kill. In this way, we at least have some process able to run
> correctly instead of every process which hit the out of disk space has
> some bad data.

the reason it isn't easily feasible is that you can learn that the
writepage fails way after the process isn't mapping the page anymore and
we can't keep unowned unwriteable dirty pages around for long, we've to
drop them. if the vm tried to writepage it means no one single task had
such page mapped, so at that time of the failure we've no clue of who to
notify for such page, all tasks just thought to have written the data
successfully when the pte dirty bit is been trasmitted to the page dirty
bit, by the time the page dirty bit is lost because writepage fails it's
too late to let know the task about it, the task may be exited long ago.
We lose track of the task when we trasmit the dirty information
from pagetable to pte, and only after that happens we will attempt a
writepage. So as far as I can tell the only way to fix it and to for
example reliably send a signal to a task to notify a write is been lost,
is to run the fs get_block(create = 1) while trasmitting the dirty bit
from pte_t to page_t, which isn't a trivial change, certainly not
something that would be confortable to do in 2.4 and it would affect
performance, it is much cleaner and efficient to deal with the fs only
at the page_t phyical pagecache layer rather than at the pte layer.
Lefting unowned, unfreeable pages around by marking the page dirty when
block_write_full_page fails, doesn't look a viable option, the kernel
could do nothing but loop forever trying to write in such case.

Andrea

2002-10-29 06:08:03

by Randy.Dunlap

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Mon, 28 Oct 2002, Andrea Arcangeli wrote:

| On Mon, Oct 28, 2002 at 10:44:20AM -0800, [email protected] wrote:
| > They are the same as shmfs to linux kernel. Why does vmware not use it
| > in the first place? It is possible due to some the history reason.
| >
| > BTW, I have another question. For the 8G memory machine, do we need
| > to setup 16G swap space? Think about the time it take to write 16G
| > data, does it still make sense that swap space is twice as big as
| > memory?
|
| swap space doesn't need to be twice as big as ram. That's fixed long
| ago.
|
| swap+ram is the total amount of virtual memory that you can use in
| vmware.
|
| > And the swap partition has limit as 2G. So we need to setup 8 swap
| > partitions if we want 16G swap.
|
| that's a silly restriction of mkswap, the kernel doesn't care, it can
| handle way more than 2G (however there's an high bound at some
| unpractical level, to go safe the math limit should be re-encoded in
| mkswap, of course it changes for every arch because the pte layout is
| different).

Heh, you hit one of my personal todo list items (larger swap spaces :),
so I'll be looking into it, or trying to help anyone else on it
if they want it.

--
~Randy

2002-10-29 07:05:14

by Andreas Dilger

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Oct 28, 2002 22:10 -0800, Randy.Dunlap wrote:
> On Mon, 28 Oct 2002, Andrea Arcangeli wrote:
> | that's a silly restriction of mkswap, the kernel doesn't care, it can
> | handle way more than 2G (however there's an high bound at some
> | unpractical level, to go safe the math limit should be re-encoded in
> | mkswap, of course it changes for every arch because the pte layout is
> | different).
>
> Heh, you hit one of my personal todo list items (larger swap spaces :),
> so I'll be looking into it, or trying to help anyone else on it
> if they want it.

If you start playing with the swap code, could you please change the
on-disk swap struct definition to look like:

union swap_header {
:
:
struct {
char bootbits[1024]; /* Space for disklabel etc. */
unsigned __u32 version;
unsigned __u32 last_page;
unsigned __u32 nr_badpages;
char volume_label[16];
unsigned __u32 padding[121];
unsigned __u32 badpages[1];
} info;
};

1) change all of the "int" definitions in info to be __u32, because this
is written to disk and we want the sizes to be unambiguous
2) the volume label field has been previously discussed and doesn't
impose any compatibility, but allows one to "swapon by label"
(old patch URL at http://user.it.uu.se/~mikpe/linux/swap-label/)

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-10-30 04:06:44

by Christopher Li

[permalink] [raw]
Subject: Re: writepage return value check in vmscan.c

On Mon, Oct 28, 2002 at 10:32:25PM +0100, Andrea Arcangeli wrote:

> the reason it isn't easily feasible is that you can learn that the
> writepage fails way after the process isn't mapping the page anymore and

Hmm, nice to know that. When I do bigmm test, I see it kswapd call to
writepage. You are telling me at that point, bigmm is dead already?
I did not know enough about the vm system. If program did not map
this memory any more. It is ok to drop then. VMware do not care about
what is left on the ram file at all.

Thanks for you long explain.

Chris

> we can't keep unowned unwriteable dirty pages around for long, we've to
> drop them. if the vm tried to writepage it means no one single task had
> such page mapped, so at that time of the failure we've no clue of who to
> notify for such page, all tasks just thought to have written the data
> successfully when the pte dirty bit is been trasmitted to the page dirty
> bit, by the time the page dirty bit is lost because writepage fails it's
> too late to let know the task about it, the task may be exited long ago.
> We lose track of the task when we trasmit the dirty information
> from pagetable to pte, and only after that happens we will attempt a
> writepage. So as far as I can tell the only way to fix it and to for
> example reliably send a signal to a task to notify a write is been lost,
> is to run the fs get_block(create = 1) while trasmitting the dirty bit
> from pte_t to page_t, which isn't a trivial change, certainly not
> something that would be confortable to do in 2.4 and it would affect
> performance, it is much cleaner and efficient to deal with the fs only
> at the page_t phyical pagecache layer rather than at the pte layer.
> Lefting unowned, unfreeable pages around by marking the page dirty when
> block_write_full_page fails, doesn't look a viable option, the kernel
> could do nothing but loop forever trying to write in such case.
>
> Andrea