2005-11-10 23:23:30

by Badari Pulavarty

[permalink] [raw]
Subject: [RFC] sys_punchhole()

Hi Andrew,

We discussed this in madvise(REMOVE) thread - to add support
for sys_punchhole(fd, offset, len) to complete the functionality
(in the future).

http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2

What I am wondering is, should I invest time now to do it ?
Or wait till need arises ?

My thought line is, I would add a generic_zeroblocks_range()
function which would zero out the given range of pages and
flush to disk. Use this as a default operation, if the
filesystems doesn't provide a specific function to free up
the blocks. Would this work ?

Suggestions ?

Thanks,
Badari


2005-11-10 23:32:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

Badari Pulavarty <[email protected]> wrote:
>
> We discussed this in madvise(REMOVE) thread - to add support
> for sys_punchhole(fd, offset, len) to complete the functionality
> (in the future).
>
> http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
>
> What I am wondering is, should I invest time now to do it ?

I haven't even heard anyone mention a need for this in the past 1-2 years.

> Or wait till need arises ?

A long wait, I suspect..

2005-11-10 23:41:23

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Thu, 2005-11-10 at 15:32 -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > We discussed this in madvise(REMOVE) thread - to add support
> > for sys_punchhole(fd, offset, len) to complete the functionality
> > (in the future).
> >
> > http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
> >
> > What I am wondering is, should I invest time now to do it ?
>
> I haven't even heard anyone mention a need for this in the past 1-2 years.
>
> > Or wait till need arises ?
>
> A long wait, I suspect..
>

Okay. I guess, I will wait till someone needs it.

I am just trying to increase my chances of "getting my madvise(REMOVE)
patch into mainline" :)

Thanks,
Badari

2005-11-10 23:55:11

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Thu, 10 Nov 2005, Badari Pulavarty wrote:
> On Thu, 2005-11-10 at 15:32 -0800, Andrew Morton wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > >
> > > We discussed this in madvise(REMOVE) thread - to add support
> > > for sys_punchhole(fd, offset, len) to complete the functionality
> > > (in the future).
> > >
> > > http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
> > >
> > > What I am wondering is, should I invest time now to do it ?
> >
> > I haven't even heard anyone mention a need for this in the past 1-2 years.
> >
> > > Or wait till need arises ?
> >
> > A long wait, I suspect..
> >
>
> Okay. I guess, I will wait till someone needs it.
>
> I am just trying to increase my chances of "getting my madvise(REMOVE)
> patch into mainline" :)
>

It may be worth asking the Samba people if they want it given that Windows
has such a function (but it is not a syscall, it is a fsctl -
FSCTL_SET_ZERO_DATA), so Samba may want to have it, too...

And in case you care, NTFS already has such functionality (currently only
used in error handling) and implementing the sys_punchole() fs-specific
function for ntfs will therefore be trivial...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-11-11 05:18:49

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Thu, 2005-11-10 at 15:23 -0800, Badari Pulavarty wrote:
>
> We discussed this in madvise(REMOVE) thread - to add support
> for sys_punchhole(fd, offset, len) to complete the functionality
> (in the future).

in the past always this was said to be "really hard" in linux locking
wise, esp. the locking with respect to truncate...

did you find a solution to this problem ?
>

2005-11-11 08:25:52

by Ingo Oeser

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

Hi,

On Friday 11 November 2005 00:32, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > We discussed this in madvise(REMOVE) thread - to add support
> > for sys_punchhole(fd, offset, len) to complete the functionality
> > (in the future).
> >
> > http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
> >
> > What I am wondering is, should I invest time now to do it ?
>
> I haven't even heard anyone mention a need for this in the past 1-2 years.

Because the people need it are usally at the application level.
It would be useful with hard disk editing.

But this would need a move_blocks within the filesystem, which
could attach a given list of blocks to another file.

E.g. mremap() for files :-)

Both together would make harddisk video editing with linux quite
performant and less error prone.


Regards

Ingo Oeser


Attachments:
(No filename) (864.00 B)
(No filename) (189.00 B)
Download all attachments

2005-11-11 19:07:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Fri, 11 Nov 2005, Ingo Oeser wrote:

> > I haven't even heard anyone mention a need for this in the past 1-2 years.
>
> Because the people need it are usally at the application level.
> It would be useful with hard disk editing.
>
> But this would need a move_blocks within the filesystem, which
> could attach a given list of blocks to another file.
>
> E.g. mremap() for files :-)

Something similar to that is included in my patch migration patchsets.

It will also allow you to selectively push pages in a range out. So it
does something similar to hole punching.

For that you would scan over the range to be cleared and put the pages on
a list using isolate_lru_page(). Then do whatever you need to with the
pages. Push em out with migrate_pages(list, NULL) etc.


2005-11-13 06:11:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

Followup to: <[email protected]>
By author: Arjan van de Ven <[email protected]>
In newsgroup: linux.dev.kernel
> On Thu, 2005-11-10 at 15:23 -0800, Badari Pulavarty wrote:
> >
> > We discussed this in madvise(REMOVE) thread - to add support
> > for sys_punchhole(fd, offset, len) to complete the functionality
> > (in the future).
>
> in the past always this was said to be "really hard" in linux locking
> wise, esp. the locking with respect to truncate...
>
> did you find a solution to this problem ?

For the sake of completeness, it should probably be stated that the
utility of such a function is pretty clear.

-hpa

2005-11-16 12:08:53

by Rob Landley

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Friday 11 November 2005 02:25, Ingo Oeser wrote:
> Hi,
>
> On Friday 11 November 2005 00:32, Andrew Morton wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > > We discussed this in madvise(REMOVE) thread - to add support
> > > for sys_punchhole(fd, offset, len) to complete the functionality
> > > (in the future).

You know, if you wanted to get really really gross and disgusting about this,
you could always have write(fd, NULL, count) punch a hole in the file. (Then
have libc's write() check for NULL and error out, and have a seprate punch()
call that does the write with the null...)

Just one way to avoid introducing a new syscall...

Rob

2005-11-16 12:20:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Wed, Nov 16, 2005 at 06:08:18AM -0600, Rob Landley wrote:
> You know, if you wanted to get really really gross and disgusting about this,
> you could always have write(fd, NULL, count) punch a hole in the file. (Then
> have libc's write() check for NULL and error out, and have a seprate punch()
> call that does the write with the null...)
>
> Just one way to avoid introducing a new syscall...

That would add an unnecessary branch in write(3). I don't think it worth
it, we'd rather go full speed and use the syscall table for it. Plus it
sounds safer in general to keep it separate (just in case someone isn't
using glibc but some other dietlibc or similar ;)

2005-11-16 16:05:19

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Fri, 2005-11-11 at 06:18 +0100, Arjan van de Ven wrote:
> On Thu, 2005-11-10 at 15:23 -0800, Badari Pulavarty wrote:
> >
> > We discussed this in madvise(REMOVE) thread - to add support
> > for sys_punchhole(fd, offset, len) to complete the functionality
> > (in the future).
>
> in the past always this was said to be "really hard" in linux locking
> wise, esp. the locking with respect to truncate...
>
> did you find a solution to this problem ?

I have been thinking about some of the race condition we might run into.
Its hard to think all of them, when I really don't have any code to play
with :(

Anyway, I think race against truncate is fine. We hold i_alloc_sem -
which should serialize against truncates. This should also serialize
against DIO. Holding i_sem should take care of writers.

One concern I can think of is, racing with read(2). While we are
thrashing pagecache and calling filesystem to free up the blocks -
a read(2) could read old disk block and give old data (since it won't
find it in pagecache). This could become a security hole :(

Thanks,
Badari

2005-11-16 16:39:15

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Wed, 16 Nov 2005, Badari Pulavarty wrote:
> On Fri, 2005-11-11 at 06:18 +0100, Arjan van de Ven wrote:
> > On Thu, 2005-11-10 at 15:23 -0800, Badari Pulavarty wrote:
> > >
> > > We discussed this in madvise(REMOVE) thread - to add support
> > > for sys_punchhole(fd, offset, len) to complete the functionality
> > > (in the future).
> >
> > in the past always this was said to be "really hard" in linux locking
> > wise, esp. the locking with respect to truncate...
> >
> > did you find a solution to this problem ?
>
> I have been thinking about some of the race condition we might run into.
> Its hard to think all of them, when I really don't have any code to play
> with :(
>
> Anyway, I think race against truncate is fine. We hold i_alloc_sem -
> which should serialize against truncates. This should also serialize
> against DIO. Holding i_sem should take care of writers.
>
> One concern I can think of is, racing with read(2). While we are
> thrashing pagecache and calling filesystem to free up the blocks -
> a read(2) could read old disk block and give old data (since it won't
> find it in pagecache). This could become a security hole :(

So why not tell the fs to perform the "punch" before dealing with the page
cache? If you do it in that order, a racing read(2) (or a racing mmapped
access for that matter) will see the hole, not the old data.

btw. I sometimes wonder whether it is correct for truncate to do the page
cache update before calling down into the fs for simillar reasons but I
think that it is ok after all because truncate only ever converts between
(exists/hole -> does not exist) or (does not exist -> exists as
zeroes/hole) but it never deals with (exists A -> exists B/hole) which is
what sys_punchhole does. I just had to adapt the address space operations
readpage and writepage in ntfs to cope with a read/write request outside
the end of the file which does happen when a racing truncate has extended
the file's i_size but the fs has not done the necessary metadata updates
yet...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-11-16 18:39:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

Hi!

> > We discussed this in madvise(REMOVE) thread - to add support
> > for sys_punchhole(fd, offset, len) to complete the functionality
> > (in the future).
> >
> > http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
> >
> > What I am wondering is, should I invest time now to do it ?
>
> I haven't even heard anyone mention a need for this in the past 1-2 years.

Some database people wanted it maybe month ago. It was replaced by some
madvise hack...

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-11-16 22:01:20

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Sun, 2005-11-13 at 15:09 +0000, Pavel Machek wrote:
> Hi!
>
> > > We discussed this in madvise(REMOVE) thread - to add support
> > > for sys_punchhole(fd, offset, len) to complete the functionality
> > > (in the future).
> > >
> > > http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
> > >
> > > What I am wondering is, should I invest time now to do it ?
> >
> > I haven't even heard anyone mention a need for this in the past 1-2 years.
>
> Some database people wanted it maybe month ago. It was replaced by some
> madvise hack...

Hmm. Someone other than me asking for it ?

I did the madvise() hack and asking to see if any one really needs
sys_punchole().

Thanks,
Badari

2005-11-16 23:38:36

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

Badari Pulavarty wrote:

>On Sun, 2005-11-13 at 15:09 +0000, Pavel Machek wrote:
>
>
>>Hi!
>>
>>
>>
>>>>We discussed this in madvise(REMOVE) thread - to add support
>>>>for sys_punchhole(fd, offset, len) to complete the functionality
>>>>(in the future).
>>>>
>>>>http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
>>>>
>>>>What I am wondering is, should I invest time now to do it ?
>>>>
>>>>
>>>I haven't even heard anyone mention a need for this in the past 1-2 years.
>>>
>>>
>>Some database people wanted it maybe month ago. It was replaced by some
>>madvise hack...
>>
>>
>
>Hmm. Someone other than me asking for it ?
>
>I did the madvise() hack and asking to see if any one really needs
>sys_punchole().
>
>Thanks,
>Badari
>
>
>
>
I think that sys_punchole() would be useful for some object based
storage systems.

Specifically, when you have a box that is trying to store potentially a
billion objects on one file system, pushing several objects into a file
("container") can be useful to keep the object count down. The punch
hole would be useful in reclaiming space in this type of scheme.

On the other side of the argument, you can argue that file systems that
support large file counts and really big directories should perform well
enough to make this use case less important.

ric


2005-11-18 16:43:12

by Ragnar Kjørstad

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Sun, Nov 13, 2005 at 03:09:06PM +0000, Pavel Machek wrote:
> > > We discussed this in madvise(REMOVE) thread - to add support
> > > for sys_punchhole(fd, offset, len) to complete the functionality
> > > (in the future).
> > >
> > > http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
> > >
> > > What I am wondering is, should I invest time now to do it ?
> >
> > I haven't even heard anyone mention a need for this in the past 1-2 years.
>
> Some database people wanted it maybe month ago. It was replaced by some
> madvise hack...


sys_punchhole is also potentially very useful for Hirarchial Storage
Management. (Holes are typically used for data that have been migrated
to tape).




--
Ragnar Kj?rstad

2005-11-18 16:55:20

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Fri, 2005-11-18 at 17:42 +0100, Ragnar Kjørstad wrote:
> On Sun, Nov 13, 2005 at 03:09:06PM +0000, Pavel Machek wrote:
> > > > We discussed this in madvise(REMOVE) thread - to add support
> > > > for sys_punchhole(fd, offset, len) to complete the functionality
> > > > (in the future).
> > > >
> > > > http://marc.theaimsgroup.com/?l=linux-mm&m=113036713810002&w=2
> > > >
> > > > What I am wondering is, should I invest time now to do it ?
> > >
> > > I haven't even heard anyone mention a need for this in the past 1-2 years.
> >
> > Some database people wanted it maybe month ago. It was replaced by some
> > madvise hack...
>
>
> sys_punchhole is also potentially very useful for Hirarchial Storage
> Management. (Holes are typically used for data that have been migrated
> to tape).

I agree. But I am not interested in adding whole lot of complexity in
the kernel, just because some "potential" use for this. I want to know,
if people/products which really really need this feature and their
requirements, before I go down that path.

For that matter, HSM folks really care about DMAPI. But I never got
them to explicitly tell me, what is the most minimum subset interfaces
they *absolutely* need (and why) in the whole DMAPI specs :( I always
hear complaints about not having DMAPI.

Thanks,
Badari

2005-11-21 08:31:56

by Rob Landley

[permalink] [raw]
Subject: Re: [RFC] sys_punchhole()

On Wednesday 16 November 2005 16:01, Badari Pulavarty wrote:
> Hmm. Someone other than me asking for it ?
>
> I did the madvise() hack and asking to see if any one really needs
> sys_punchole().

I run into a potential use case for every once in a while. For example, there
was recent discussion on the User Mode Linux list about this, since the
"physical memory" that uses is an mmaped file so the logical way to give
unused memory back to the host OS (initially via a hotplug memory interface
driven by some kind of daemon, since the pagecache expands to fill all
available space even when the data is also redundantly cached by the host OS)
would by via sys_punchole().

Of course UML's physmem file is normally on a tmpfs() mount, where
madvise(DONTNEED) has special behavior to work like punch anyway. So it
looks like special cases to work around this lack can be added ad infinitum
so there's never any immediate need for the actual generic functionality.

On the other hand, if you're going to support holes at all, having to recreate
the file to get your hole back is kind of silly. I personally think the
ability to create holes in a new file but not create holes in an existing
file is every bit as strange as being able to extend a file but not truncate
it. (See the java 1.1 api for an example of _that_ particular thinko...)

Rob