Hi, Miklos,
We spotted a performance bottleneck for FUSE writeback in which the
writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
used for copy_page().
fuse_writepages_fill
alloc tmp_page
copy_highpage
This is because of FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap")), which newly allocates a temp page for
each dirty page to be written back, copy content of dirty page to temp
page, and then write back the temp page instead. This special design is
intentional to avoid potential deadlocked due to buggy or even malicious
fuse user daemon.
There was a proposal of removing this constraint for virtiofs [1], which
is reasonable as users of virtiofs and virtiofs daemon don't run on the
same OS, and virtiofs daemon is usually offered by cloud vendors that
shall not be malicious. While for the normal /dev/fuse interface, I
don't think removing the constraint is acceptable.
Come back to the writeback performance bottleneck. Another important
factor is that, (IIUC) only one kworker at the same time is allowed for
writeback for each filesystem instance (if cgroup writeback is not
enabled). The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
workqueue infrastructure guarantees that at most one *running* worker is
allowed for one specific work (sb->s_bdi->wb.dwork) at any time. Thus
the writeback is constraint to one CPU for each filesystem instance.
I'm not sure if offloading the page copying and then FUSE requests
sending to another worker (if a bunch of dirty pages have been
collected) is a good idea or not, e.g.
```
fuse_writepages_fill
if fuse_writepage_need_send:
# schedule a work
# the worker
for each dirty page in ap->pages[]:
copy_page
fuse_writepages_send
```
Any suggestion?
This issue can be reproduced by:
1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
source=/run/ /mnt
("/run/" is a tmpfs mount)
2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
--direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
(at least two threads are needed; fio shows ~1800MiB/s buffer write
bandwidth)
[1]
https://lore.kernel.org/all/[email protected]/
--
Thanks,
Jingbo
On 6/3/24 08:17, Jingbo Xu wrote:
> Hi, Miklos,
>
> We spotted a performance bottleneck for FUSE writeback in which the
> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> used for copy_page().
>
> fuse_writepages_fill
> alloc tmp_page
> copy_highpage
>
> This is because of FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap")), which newly allocates a temp page for
> each dirty page to be written back, copy content of dirty page to temp
> page, and then write back the temp page instead. This special design is
> intentional to avoid potential deadlocked due to buggy or even malicious
> fuse user daemon.
I also noticed that and I admin that I don't understand it yet. The commit says
<quote>
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
</quote>
Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
how fast storage is?
Buggy - hmm yeah, then it is splice related only? But I think splice feature was
not introduced yet when fuse got mmap and writeback in 2008?
Without splice the pages are just copied into a userspace buffer? So what can
userspace do wrong with its copy?
Failure - why can't it do what nfs_mapping_set_error() does?
I guess I miss something, but so far I don't understand what that is.
>
> There was a proposal of removing this constraint for virtiofs [1], which
> is reasonable as users of virtiofs and virtiofs daemon don't run on the
> same OS, and virtiofs daemon is usually offered by cloud vendors that
> shall not be malicious. While for the normal /dev/fuse interface, I
> don't think removing the constraint is acceptable.
>
>
> Come back to the writeback performance bottleneck. Another important
> factor is that, (IIUC) only one kworker at the same time is allowed for
> writeback for each filesystem instance (if cgroup writeback is not
> enabled). The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
> workqueue infrastructure guarantees that at most one *running* worker is
> allowed for one specific work (sb->s_bdi->wb.dwork) at any time. Thus
> the writeback is constraint to one CPU for each filesystem instance.
>
> I'm not sure if offloading the page copying and then FUSE requests
> sending to another worker (if a bunch of dirty pages have been
> collected) is a good idea or not, e.g.
>
> ```
> fuse_writepages_fill
> if fuse_writepage_need_send:
> # schedule a work
>
> # the worker
> for each dirty page in ap->pages[]:
> copy_page
> fuse_writepages_send
> ```
>
> Any suggestion?
>
>
>
> This issue can be reproduced by:
>
> 1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
> source=/run/ /mnt
> ("/run/" is a tmpfs mount)
>
> 2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
> --direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
> (at least two threads are needed; fio shows ~1800MiB/s buffer write
> bandwidth)
That should quickly run out of tmpfs memory. I need to find time to improve
this a bit, but this should give you an easier test:
https://github.com/libfuse/libfuse/pull/807
>
>
> [1]
> https://lore.kernel.org/all/[email protected]/
>
>
Thanks,
Bernd
On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <[email protected]> wrote:
>
>
>
> On 6/3/24 08:17, Jingbo Xu wrote:
> > Hi, Miklos,
> >
> > We spotted a performance bottleneck for FUSE writeback in which the
> > writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> > used for copy_page().
> >
> > fuse_writepages_fill
> > alloc tmp_page
> > copy_highpage
> >
> > This is because of FUSE writeback design (see commit 3be5a52b30aa
> > ("fuse: support writable mmap")), which newly allocates a temp page for
> > each dirty page to be written back, copy content of dirty page to temp
> > page, and then write back the temp page instead. This special design is
> > intentional to avoid potential deadlocked due to buggy or even malicious
> > fuse user daemon.
>
> I also noticed that and I admin that I don't understand it yet. The commit says
>
> <quote>
> The basic problem is that there can be no guarantee about the time in which
> the userspace filesystem will complete a write. It may be buggy or even
> malicious, and fail to complete WRITE requests. We don't want unrelated parts
> of the system to grind to a halt in such cases.
> </quote>
>
>
> Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
> how fast storage is?
I don't have the details but it boils down to the fact that the
allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
used by the unprivileged userspace server (and even if it could,
there's no guarantee, that it would).
When this mechanism was introduced, the deadlock was a real
possibility. I'm not sure that it can still happen, but proving that
it cannot might be difficult.
Thanks,
Miklos
On 6/3/24 17:19, Miklos Szeredi wrote:
> On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <[email protected]> wrote:
>>
>>
>>
>> On 6/3/24 08:17, Jingbo Xu wrote:
>>> Hi, Miklos,
>>>
>>> We spotted a performance bottleneck for FUSE writeback in which the
>>> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
>>> used for copy_page().
>>>
>>> fuse_writepages_fill
>>> alloc tmp_page
>>> copy_highpage
>>>
>>> This is because of FUSE writeback design (see commit 3be5a52b30aa
>>> ("fuse: support writable mmap")), which newly allocates a temp page for
>>> each dirty page to be written back, copy content of dirty page to temp
>>> page, and then write back the temp page instead. This special design is
>>> intentional to avoid potential deadlocked due to buggy or even malicious
>>> fuse user daemon.
>>
>> I also noticed that and I admin that I don't understand it yet. The commit says
>>
>> <quote>
>> The basic problem is that there can be no guarantee about the time in which
>> the userspace filesystem will complete a write. It may be buggy or even
>> malicious, and fail to complete WRITE requests. We don't want unrelated parts
>> of the system to grind to a halt in such cases.
>> </quote>
>>
>>
>> Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
>> how fast storage is?
>
> I don't have the details but it boils down to the fact that the
> allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
> used by the unprivileged userspace server (and even if it could,
> there's no guarantee, that it would).
>
> When this mechanism was introduced, the deadlock was a real
> possibility. I'm not sure that it can still happen, but proving that
> it cannot might be difficult.
Thanks Miklos!
I need to go through all of the GFP_NOFS allocation, but I wonder if we
could introduce cached allocations and fall back to the slow path if
that didn't work.
Thanks,
Bernd
On Mon, Jun 03, 2024 at 05:19:44PM +0200, Miklos Szeredi wrote:
> On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <[email protected]> wrote:
> >
> >
> >
> > On 6/3/24 08:17, Jingbo Xu wrote:
> > > Hi, Miklos,
> > >
> > > We spotted a performance bottleneck for FUSE writeback in which the
> > > writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> > > used for copy_page().
> > >
> > > fuse_writepages_fill
> > > alloc tmp_page
> > > copy_highpage
> > >
> > > This is because of FUSE writeback design (see commit 3be5a52b30aa
> > > ("fuse: support writable mmap")), which newly allocates a temp page for
> > > each dirty page to be written back, copy content of dirty page to temp
> > > page, and then write back the temp page instead. This special design is
> > > intentional to avoid potential deadlocked due to buggy or even malicious
> > > fuse user daemon.
> >
> > I also noticed that and I admin that I don't understand it yet. The commit says
> >
> > <quote>
> > The basic problem is that there can be no guarantee about the time in which
> > the userspace filesystem will complete a write. It may be buggy or even
> > malicious, and fail to complete WRITE requests. We don't want unrelated parts
> > of the system to grind to a halt in such cases.
> > </quote>
> >
> >
> > Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
> > how fast storage is?
>
> I don't have the details but it boils down to the fact that the
> allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
> used by the unprivileged userspace server (and even if it could,
> there's no guarantee, that it would).
I thought we had PR_SET_IO_FLUSHER for that. Requires
CAP_SYS_RESOURCES but no other privileges, then the userspace
server will then always operate in PF_MEMALLOC_NOIO |
PF_LOCAL_THROTTLE memory allocation context.
-Dave.
--
Dave Chinner
[email protected]
Hi Bernd and Miklos,
On 6/3/24 11:19 PM, Miklos Szeredi wrote:
> On Mon, 3 Jun 2024 at 16:43, Bernd Schubert <[email protected]> wrote:
>>
>>
>>
>> On 6/3/24 08:17, Jingbo Xu wrote:
>>> Hi, Miklos,
>>>
>>> We spotted a performance bottleneck for FUSE writeback in which the
>>> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
>>> used for copy_page().
>>>
>>> fuse_writepages_fill
>>> alloc tmp_page
>>> copy_highpage
>>>
>>> This is because of FUSE writeback design (see commit 3be5a52b30aa
>>> ("fuse: support writable mmap")), which newly allocates a temp page for
>>> each dirty page to be written back, copy content of dirty page to temp
>>> page, and then write back the temp page instead. This special design is
>>> intentional to avoid potential deadlocked due to buggy or even malicious
>>> fuse user daemon.
>>
>> I also noticed that and I admin that I don't understand it yet. The commit says
>>
>> <quote>
>> The basic problem is that there can be no guarantee about the time in which
>> the userspace filesystem will complete a write. It may be buggy or even
>> malicious, and fail to complete WRITE requests. We don't want unrelated parts
>> of the system to grind to a halt in such cases.
>> </quote>
>>
>>
>> Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
>> how fast storage is?
>
> I don't have the details but it boils down to the fact that the
> allocation context provided by GFP_NOFS (PF_MEMALLOC_NOFS) cannot be
> used by the unprivileged userspace server (and even if it could,
> there's no guarantee, that it would).
>
> When this mechanism was introduced, the deadlock was a real
> possibility. I'm not sure that it can still happen, but proving that
> it cannot might be difficult.
IIUC, there are two sources that may cause deadlock:
1) the fuse server needs memory allocation when processing FUSE_WRITE
requests, which in turn triggers direct memory reclaim, and FUSE
writeback then - deadlock here
2) a process that trigfgers direct memory reclaim or calls sync(2) may
hang there forever, if the fuse server is buggyly or malicious and thus
hang there when processing FUSE_WRITE requests
Thus the temp page design was introduced to avoid the above potential
issues.
I think case 1 may be fixed (if any), but I don't know how case 2 can be
avoided as any one could run a fuse server in unprivileged mode. Or if
case 2 really matters? Please correct me if I miss something.
--
Thanks,
Jingbo
On Tue, 4 Jun 2024 at 00:10, Dave Chinner <[email protected]> wrote:
> I thought we had PR_SET_IO_FLUSHER for that. Requires
> CAP_SYS_RESOURCES but no other privileges, then the userspace
> server will then always operate in PF_MEMALLOC_NOIO |
> PF_LOCAL_THROTTLE memory allocation context.
There could be any number of services that are being used while
serving a fuse request. There's no well defined "fuse server
process", as many people seem to think. Any approach depending on
somehow marking the fuse server as a special entity will fail.
Thanks,
Miklos
On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <[email protected]> wrote:
> IIUC, there are two sources that may cause deadlock:
> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> requests, which in turn triggers direct memory reclaim, and FUSE
> writeback then - deadlock here
Yep, see the folio_wait_writeback() call deep in the guts of direct
reclaim, which sleeps until the PG_writeback flag is cleared. If that
happens to be triggered by the writeback in question, then that's a
deadlock.
> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
> hang there forever, if the fuse server is buggyly or malicious and thus
> hang there when processing FUSE_WRITE requests
Ah, yes, sync(2) is also an interesting case. We don't want unpriv
fuse servers to be able to block sync(2), which means that sync(2)
won't actually guarantee a synchronization of fuse's dirty pages. I
don't think there's even a theoretical solution to that, but
apparently nobody cares...
Thanks,
Mikos
>
> Thus the temp page design was introduced to avoid the above potential
> issues.
>
> I think case 1 may be fixed (if any), but I don't know how case 2 can be
> avoided as any one could run a fuse server in unprivileged mode. Or if
> case 2 really matters? Please correct me if I miss something.
>
> --
> Thanks,
> Jingbo
On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <[email protected]> wrote:
>
>> IIUC, there are two sources that may cause deadlock:
>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>> requests, which in turn triggers direct memory reclaim, and FUSE
>> writeback then - deadlock here
>
> Yep, see the folio_wait_writeback() call deep in the guts of direct
> reclaim, which sleeps until the PG_writeback flag is cleared. If that
> happens to be triggered by the writeback in question, then that's a
> deadlock.
>
>> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
>> hang there forever, if the fuse server is buggyly or malicious and thus
>> hang there when processing FUSE_WRITE requests
>
> Ah, yes, sync(2) is also an interesting case. We don't want unpriv
> fuse servers to be able to block sync(2), which means that sync(2)
> won't actually guarantee a synchronization of fuse's dirty pages. I
> don't think there's even a theoretical solution to that, but
> apparently nobody cares...
Okay if the temp page design is unavoidable, then I don't know if there
is any approach (in FUSE or VFS layer) helps page copy offloading. At
least we don't want the writeback performance to be limited by the
single writeback kworker. This is also the initial attempt of this thread.
--
Thanks,
Jingbo
On 6/4/24 09:36, Jingbo Xu wrote:
>
>
> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <[email protected]> wrote:
>>
>>> IIUC, there are two sources that may cause deadlock:
>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>> writeback then - deadlock here
>>
>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>> reclaim, which sleeps until the PG_writeback flag is cleared. If that
>> happens to be triggered by the writeback in question, then that's a
>> deadlock.
>>
>>> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
>>> hang there forever, if the fuse server is buggyly or malicious and thus
>>> hang there when processing FUSE_WRITE requests
>>
>> Ah, yes, sync(2) is also an interesting case. We don't want unpriv
>> fuse servers to be able to block sync(2), which means that sync(2)
>> won't actually guarantee a synchronization of fuse's dirty pages. I
>> don't think there's even a theoretical solution to that, but
>> apparently nobody cares...
>
> Okay if the temp page design is unavoidable, then I don't know if there
> is any approach (in FUSE or VFS layer) helps page copy offloading. At
> least we don't want the writeback performance to be limited by the
> single writeback kworker. This is also the initial attempt of this thread.
>
Offloading it to another thread is just a workaround, though maybe a
temporary solution.
Back to the background for the copy, so it copies pages to avoid
blocking on memory reclaim. With that allocation it in fact increases
memory pressure even more. Isn't the right solution to mark those pages
as not reclaimable and to avoid blocking on it? Which is what the tmp
pages do, just not in beautiful way.
Thanks,
Bernd
On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <[email protected]> wrote:
> Back to the background for the copy, so it copies pages to avoid
> blocking on memory reclaim. With that allocation it in fact increases
> memory pressure even more. Isn't the right solution to mark those pages
> as not reclaimable and to avoid blocking on it? Which is what the tmp
> pages do, just not in beautiful way.
Copying to the tmp page is the same as marking the pages as
non-reclaimable and non-syncable.
Conceptually it would be nice to only copy when there's something
actually waiting for writeback on the page.
Note: normally the WRITE request would be copied to userspace along
with the contents of the pages very soon after starting writeback.
After this the contents of the page no longer matter, and we can just
clear writeback without doing the copy.
But if the request gets stuck in the input queue before being copied
to userspace, then deadlock can still happen if the server blocks on
direct reclaim and won't continue with processing the queue. And
sync(2) will also block in that case.
So we'd somehow need to handle stuck WRITE requests. I don't see an
easy way to do this "on demand", when something actually starts
waiting on PG_writeback. Alternatively the page copy could be done
after a timeout, which is ugly, but much easier to implement.
Also splice from the fuse dev would need to copy those pages, but that
shouldn't be a problem, since it's just moving the copy from one place
to another.
Thanks,
Miklos
On 6/4/24 5:32 PM, Bernd Schubert wrote:
>
>
> On 6/4/24 09:36, Jingbo Xu wrote:
>>
>>
>> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <[email protected]> wrote:
>>>
>>>> IIUC, there are two sources that may cause deadlock:
>>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>>> writeback then - deadlock here
>>>
>>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>>> reclaim, which sleeps until the PG_writeback flag is cleared. If that
>>> happens to be triggered by the writeback in question, then that's a
>>> deadlock.
>>>
>>>> 2) a process that trigfgers direct memory reclaim or calls sync(2) may
>>>> hang there forever, if the fuse server is buggyly or malicious and thus
>>>> hang there when processing FUSE_WRITE requests
>>>
>>> Ah, yes, sync(2) is also an interesting case. We don't want unpriv
>>> fuse servers to be able to block sync(2), which means that sync(2)
>>> won't actually guarantee a synchronization of fuse's dirty pages. I
>>> don't think there's even a theoretical solution to that, but
>>> apparently nobody cares...
>>
>> Okay if the temp page design is unavoidable, then I don't know if there
>> is any approach (in FUSE or VFS layer) helps page copy offloading. At
>> least we don't want the writeback performance to be limited by the
>> single writeback kworker. This is also the initial attempt of this thread.
>>
>
> Offloading it to another thread is just a workaround, though maybe a
> temporary solution.
If we could break the limit that only one single (writeback) kworker for
one bdi... Apparently it's much more complicated. Just a brainstorming
idea...
I agree it's a tough thing.
--
Thanks,
Jingbo
On 6/4/24 12:02, Miklos Szeredi wrote:
> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <[email protected]> wrote:
>
>> Back to the background for the copy, so it copies pages to avoid
>> blocking on memory reclaim. With that allocation it in fact increases
>> memory pressure even more. Isn't the right solution to mark those pages
>> as not reclaimable and to avoid blocking on it? Which is what the tmp
>> pages do, just not in beautiful way.
>
> Copying to the tmp page is the same as marking the pages as
> non-reclaimable and non-syncable.
>
> Conceptually it would be nice to only copy when there's something
> actually waiting for writeback on the page.
>
> Note: normally the WRITE request would be copied to userspace along
> with the contents of the pages very soon after starting writeback.
> After this the contents of the page no longer matter, and we can just
> clear writeback without doing the copy.
>
> But if the request gets stuck in the input queue before being copied
> to userspace, then deadlock can still happen if the server blocks on
> direct reclaim and won't continue with processing the queue. And
> sync(2) will also block in that case.>
> So we'd somehow need to handle stuck WRITE requests. I don't see an
> easy way to do this "on demand", when something actually starts
> waiting on PG_writeback. Alternatively the page copy could be done
> after a timeout, which is ugly, but much easier to implement.
I think the timeout method would only work if we have already allocated
the pages, under memory pressure page allocation might not work well.
But then this still seems to be a workaround, because we don't take any
less memory with these copied pages.
I'm going to look into mm/ if there isn't a better solution.
>
> Also splice from the fuse dev would need to copy those pages, but that
> shouldn't be a problem, since it's just moving the copy from one place
> to another.
Ok, at least I need to keep an eye on it that it doesn't break when I
write a patch.
Thanks,
Bernd
On 6/4/24 18:53, Josef Bacik wrote:
> On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
>>
>>
>> On 6/4/24 12:02, Miklos Szeredi wrote:
>>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <[email protected]> wrote:
>>>
>>>> Back to the background for the copy, so it copies pages to avoid
>>>> blocking on memory reclaim. With that allocation it in fact increases
>>>> memory pressure even more. Isn't the right solution to mark those pages
>>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
>>>> pages do, just not in beautiful way.
>>>
>>> Copying to the tmp page is the same as marking the pages as
>>> non-reclaimable and non-syncable.
>>>
>>> Conceptually it would be nice to only copy when there's something
>>> actually waiting for writeback on the page.
>>>
>>> Note: normally the WRITE request would be copied to userspace along
>>> with the contents of the pages very soon after starting writeback.
>>> After this the contents of the page no longer matter, and we can just
>>> clear writeback without doing the copy.
>>>
>>> But if the request gets stuck in the input queue before being copied
>>> to userspace, then deadlock can still happen if the server blocks on
>>> direct reclaim and won't continue with processing the queue. And
>>> sync(2) will also block in that case.>
>>> So we'd somehow need to handle stuck WRITE requests. I don't see an
>>> easy way to do this "on demand", when something actually starts
>>> waiting on PG_writeback. Alternatively the page copy could be done
>>> after a timeout, which is ugly, but much easier to implement.
>>
>> I think the timeout method would only work if we have already allocated
>> the pages, under memory pressure page allocation might not work well.
>> But then this still seems to be a workaround, because we don't take any
>> less memory with these copied pages.
>> I'm going to look into mm/ if there isn't a better solution.
>
> I've thought a bit about this, and I still don't have a good solution, so I'm
> going to throw out my random thoughts and see if it helps us get to a good spot.
>
> 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
> memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
> these contexts. We could do something similar for FUSE, tho this gets hairy
> with things that async off request handling to other threads (which is all of
> the FUSE file systems we have internally). We'd need to have some way to
> apply this to an entire process group, but this could be a workable solution.
>
I'm not sure how either of of both (GFP_ and memalloc_) would work for
userspace allocations.
Wouldn't we basically need to have a feature to disable memory
allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
Although even then, the file system might depend on other kernel
resources (backend file system or block device or even network) that
might do allocations on their own without the knowledge of the fuse server.
> 2. Per-request timeouts. This is something we're planning on tackling for other
> reasons, but it could fit nicely here to say "if this fuse fs has a
> per-request timeout, skip the copy". That way we at least know we're upper
> bound on how long we would be "deadlocked". I don't love this approach
> because it's still a deadlock until the timeout elapsed, but it's an idea.
Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
think we could trust initialization flags set by userspace.
>
> 3. Since we're limiting writeout per the BDI, we could just say FUSE is special,
> only one memory reclaim related writeout at a time. We flag when we're doing
> a write via memory reclaim, and then if we try to trigger writeout via memory
> reclaim again we simply reject it to avoid the deadlock. This has the
> downside of making it so non-fuse related things that may be triggering
> direct reclaim through FUSE means they'll reclaim something else, and if the
> dirty pages from FUSE are the ones causing the problem we could spin a bunch
> evicting pages that we don't care about and thrashing a bit.
Isn't that what we have right now? Reclaim basically ignores fuse tmp pages.
Thanks,
Bernd
On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
>
>
> On 6/4/24 12:02, Miklos Szeredi wrote:
> > On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <[email protected]> wrote:
> >
> >> Back to the background for the copy, so it copies pages to avoid
> >> blocking on memory reclaim. With that allocation it in fact increases
> >> memory pressure even more. Isn't the right solution to mark those pages
> >> as not reclaimable and to avoid blocking on it? Which is what the tmp
> >> pages do, just not in beautiful way.
> >
> > Copying to the tmp page is the same as marking the pages as
> > non-reclaimable and non-syncable.
> >
> > Conceptually it would be nice to only copy when there's something
> > actually waiting for writeback on the page.
> >
> > Note: normally the WRITE request would be copied to userspace along
> > with the contents of the pages very soon after starting writeback.
> > After this the contents of the page no longer matter, and we can just
> > clear writeback without doing the copy.
> >
> > But if the request gets stuck in the input queue before being copied
> > to userspace, then deadlock can still happen if the server blocks on
> > direct reclaim and won't continue with processing the queue. And
> > sync(2) will also block in that case.>
> > So we'd somehow need to handle stuck WRITE requests. I don't see an
> > easy way to do this "on demand", when something actually starts
> > waiting on PG_writeback. Alternatively the page copy could be done
> > after a timeout, which is ugly, but much easier to implement.
>
> I think the timeout method would only work if we have already allocated
> the pages, under memory pressure page allocation might not work well.
> But then this still seems to be a workaround, because we don't take any
> less memory with these copied pages.
> I'm going to look into mm/ if there isn't a better solution.
I've thought a bit about this, and I still don't have a good solution, so I'm
going to throw out my random thoughts and see if it helps us get to a good spot.
1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
these contexts. We could do something similar for FUSE, tho this gets hairy
with things that async off request handling to other threads (which is all of
the FUSE file systems we have internally). We'd need to have some way to
apply this to an entire process group, but this could be a workable solution.
2. Per-request timeouts. This is something we're planning on tackling for other
reasons, but it could fit nicely here to say "if this fuse fs has a
per-request timeout, skip the copy". That way we at least know we're upper
bound on how long we would be "deadlocked". I don't love this approach
because it's still a deadlock until the timeout elapsed, but it's an idea.
3. Since we're limiting writeout per the BDI, we could just say FUSE is special,
only one memory reclaim related writeout at a time. We flag when we're doing
a write via memory reclaim, and then if we try to trigger writeout via memory
reclaim again we simply reject it to avoid the deadlock. This has the
downside of making it so non-fuse related things that may be triggering
direct reclaim through FUSE means they'll reclaim something else, and if the
dirty pages from FUSE are the ones causing the problem we could spin a bunch
evicting pages that we don't care about and thrashing a bit.
As I said all of these have downsides, I think #1 is probably the most workable,
but I haven't thought about it super thoroughly. Thanks,
Josef
On Tue, Jun 04, 2024 at 11:39:17PM +0200, Bernd Schubert wrote:
>
>
> On 6/4/24 18:53, Josef Bacik wrote:
> > On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
> >>
> >>
> >> On 6/4/24 12:02, Miklos Szeredi wrote:
> >>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <[email protected]> wrote:
> >>>
> >>>> Back to the background for the copy, so it copies pages to avoid
> >>>> blocking on memory reclaim. With that allocation it in fact increases
> >>>> memory pressure even more. Isn't the right solution to mark those pages
> >>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
> >>>> pages do, just not in beautiful way.
> >>>
> >>> Copying to the tmp page is the same as marking the pages as
> >>> non-reclaimable and non-syncable.
> >>>
> >>> Conceptually it would be nice to only copy when there's something
> >>> actually waiting for writeback on the page.
> >>>
> >>> Note: normally the WRITE request would be copied to userspace along
> >>> with the contents of the pages very soon after starting writeback.
> >>> After this the contents of the page no longer matter, and we can just
> >>> clear writeback without doing the copy.
> >>>
> >>> But if the request gets stuck in the input queue before being copied
> >>> to userspace, then deadlock can still happen if the server blocks on
> >>> direct reclaim and won't continue with processing the queue. And
> >>> sync(2) will also block in that case.>
> >>> So we'd somehow need to handle stuck WRITE requests. I don't see an
> >>> easy way to do this "on demand", when something actually starts
> >>> waiting on PG_writeback. Alternatively the page copy could be done
> >>> after a timeout, which is ugly, but much easier to implement.
> >>
> >> I think the timeout method would only work if we have already allocated
> >> the pages, under memory pressure page allocation might not work well.
> >> But then this still seems to be a workaround, because we don't take any
> >> less memory with these copied pages.
> >> I'm going to look into mm/ if there isn't a better solution.
> >
> > I've thought a bit about this, and I still don't have a good solution, so I'm
> > going to throw out my random thoughts and see if it helps us get to a good spot.
> >
> > 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
> > memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
> > these contexts. We could do something similar for FUSE, tho this gets hairy
> > with things that async off request handling to other threads (which is all of
> > the FUSE file systems we have internally). We'd need to have some way to
> > apply this to an entire process group, but this could be a workable solution.
> >
>
> I'm not sure how either of of both (GFP_ and memalloc_) would work for
> userspace allocations.
> Wouldn't we basically need to have a feature to disable memory
> allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
> Although even then, the file system might depend on other kernel
> resources (backend file system or block device or even network) that
> might do allocations on their own without the knowledge of the fuse server.
>
Basically that only in the case that we're handling a request from memory
pressure we would invoke this, and then any allocation would automatically have
gfp_nofs protection because it's flagged at the task level.
Again there's a lot of problems with this, like how do we set it for the task,
how does it work for threads etc.
> > 2. Per-request timeouts. This is something we're planning on tackling for other
> > reasons, but it could fit nicely here to say "if this fuse fs has a
> > per-request timeout, skip the copy". That way we at least know we're upper
> > bound on how long we would be "deadlocked". I don't love this approach
> > because it's still a deadlock until the timeout elapsed, but it's an idea.
>
> Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
> think we could trust initialization flags set by userspace.
>
It would be controlled by the kernel. So at init time the fuse file system says
"my command timeout is 30 minutes." Then the kernel enforces this by having a
per-request timeout, and once that 30 minutes elapses we cancel the request and
EIO it. User space doesn't do anything beyond telling the kernel what it's
timeout is, so this would be safe.
> >
> > 3. Since we're limiting writeout per the BDI, we could just say FUSE is special,
> > only one memory reclaim related writeout at a time. We flag when we're doing
> > a write via memory reclaim, and then if we try to trigger writeout via memory
> > reclaim again we simply reject it to avoid the deadlock. This has the
> > downside of making it so non-fuse related things that may be triggering
> > direct reclaim through FUSE means they'll reclaim something else, and if the
> > dirty pages from FUSE are the ones causing the problem we could spin a bunch
> > evicting pages that we don't care about and thrashing a bit.
>
>
> Isn't that what we have right now? Reclaim basically ignores fuse tmp pages.
Yes but extending it to no longer have tmp pages and tie it to the BDI instead,
my goal is to get rid of all the excess copying. Thanks,
Josef
On Wed, Jun 5, 2024 at 1:17 AM Josef Bacik <[email protected]> wrote:
>
> On Tue, Jun 04, 2024 at 11:39:17PM +0200, Bernd Schubert wrote:
> >
> >
> > On 6/4/24 18:53, Josef Bacik wrote:
> > > On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
> > >>
> > >>
> > >> On 6/4/24 12:02, Miklos Szeredi wrote:
> > >>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <[email protected]> wrote:
> > >>>
> > >>>> Back to the background for the copy, so it copies pages to avoid
> > >>>> blocking on memory reclaim. With that allocation it in fact increases
> > >>>> memory pressure even more. Isn't the right solution to mark those pages
> > >>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
> > >>>> pages do, just not in beautiful way.
> > >>>
> > >>> Copying to the tmp page is the same as marking the pages as
> > >>> non-reclaimable and non-syncable.
> > >>>
> > >>> Conceptually it would be nice to only copy when there's something
> > >>> actually waiting for writeback on the page.
> > >>>
> > >>> Note: normally the WRITE request would be copied to userspace along
> > >>> with the contents of the pages very soon after starting writeback.
> > >>> After this the contents of the page no longer matter, and we can just
> > >>> clear writeback without doing the copy.
> > >>>
> > >>> But if the request gets stuck in the input queue before being copied
> > >>> to userspace, then deadlock can still happen if the server blocks on
> > >>> direct reclaim and won't continue with processing the queue. And
> > >>> sync(2) will also block in that case.>
> > >>> So we'd somehow need to handle stuck WRITE requests. I don't see an
> > >>> easy way to do this "on demand", when something actually starts
> > >>> waiting on PG_writeback. Alternatively the page copy could be done
> > >>> after a timeout, which is ugly, but much easier to implement.
> > >>
> > >> I think the timeout method would only work if we have already allocated
> > >> the pages, under memory pressure page allocation might not work well.
> > >> But then this still seems to be a workaround, because we don't take any
> > >> less memory with these copied pages.
> > >> I'm going to look into mm/ if there isn't a better solution.
> > >
> > > I've thought a bit about this, and I still don't have a good solution, so I'm
> > > going to throw out my random thoughts and see if it helps us get to a good spot.
> > >
> > > 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
> > > memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
> > > these contexts. We could do something similar for FUSE, tho this gets hairy
> > > with things that async off request handling to other threads (which is all of
> > > the FUSE file systems we have internally). We'd need to have some way to
> > > apply this to an entire process group, but this could be a workable solution.
> > >
> >
> > I'm not sure how either of of both (GFP_ and memalloc_) would work for
> > userspace allocations.
> > Wouldn't we basically need to have a feature to disable memory
> > allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
> > Although even then, the file system might depend on other kernel
> > resources (backend file system or block device or even network) that
> > might do allocations on their own without the knowledge of the fuse server.
> >
>
> Basically that only in the case that we're handling a request from memory
> pressure we would invoke this, and then any allocation would automatically have
> gfp_nofs protection because it's flagged at the task level.
>
> Again there's a lot of problems with this, like how do we set it for the task,
> how does it work for threads etc.
>
> > > 2. Per-request timeouts. This is something we're planning on tackling for other
> > > reasons, but it could fit nicely here to say "if this fuse fs has a
> > > per-request timeout, skip the copy". That way we at least know we're upper
> > > bound on how long we would be "deadlocked". I don't love this approach
> > > because it's still a deadlock until the timeout elapsed, but it's an idea.
> >
> > Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
> > think we could trust initialization flags set by userspace.
> >
>
> It would be controlled by the kernel. So at init time the fuse file system says
> "my command timeout is 30 minutes." Then the kernel enforces this by having a
> per-request timeout, and once that 30 minutes elapses we cancel the request and
> EIO it. User space doesn't do anything beyond telling the kernel what it's
> timeout is, so this would be safe.
>
Maybe that would be better to configure by mounter, similar to nfs -otimeo
and maybe consider opt-in to returning ETIMEDOUT in this case.
At least nfsd will pass that error to nfs client and nfs client will retry.
Different applications (or network protocols) handle timeouts differently,
so the timeout and error seems like a decision for the admin/mounter not
for the fuse server, although there may be a fuse fs that would want to
set the default timeout, as if to request the kernel to be its watchdog
(i.e. do not expect me to take more than 30 min to handle any request).
Thanks,
Amir.
On Wed, Jun 05, 2024 at 08:49:48AM +0300, Amir Goldstein wrote:
> On Wed, Jun 5, 2024 at 1:17 AM Josef Bacik <[email protected]> wrote:
> >
> > On Tue, Jun 04, 2024 at 11:39:17PM +0200, Bernd Schubert wrote:
> > >
> > >
> > > On 6/4/24 18:53, Josef Bacik wrote:
> > > > On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote:
> > > >>
> > > >>
> > > >> On 6/4/24 12:02, Miklos Szeredi wrote:
> > > >>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert <[email protected]> wrote:
> > > >>>
> > > >>>> Back to the background for the copy, so it copies pages to avoid
> > > >>>> blocking on memory reclaim. With that allocation it in fact increases
> > > >>>> memory pressure even more. Isn't the right solution to mark those pages
> > > >>>> as not reclaimable and to avoid blocking on it? Which is what the tmp
> > > >>>> pages do, just not in beautiful way.
> > > >>>
> > > >>> Copying to the tmp page is the same as marking the pages as
> > > >>> non-reclaimable and non-syncable.
> > > >>>
> > > >>> Conceptually it would be nice to only copy when there's something
> > > >>> actually waiting for writeback on the page.
> > > >>>
> > > >>> Note: normally the WRITE request would be copied to userspace along
> > > >>> with the contents of the pages very soon after starting writeback.
> > > >>> After this the contents of the page no longer matter, and we can just
> > > >>> clear writeback without doing the copy.
> > > >>>
> > > >>> But if the request gets stuck in the input queue before being copied
> > > >>> to userspace, then deadlock can still happen if the server blocks on
> > > >>> direct reclaim and won't continue with processing the queue. And
> > > >>> sync(2) will also block in that case.>
> > > >>> So we'd somehow need to handle stuck WRITE requests. I don't see an
> > > >>> easy way to do this "on demand", when something actually starts
> > > >>> waiting on PG_writeback. Alternatively the page copy could be done
> > > >>> after a timeout, which is ugly, but much easier to implement.
> > > >>
> > > >> I think the timeout method would only work if we have already allocated
> > > >> the pages, under memory pressure page allocation might not work well.
> > > >> But then this still seems to be a workaround, because we don't take any
> > > >> less memory with these copied pages.
> > > >> I'm going to look into mm/ if there isn't a better solution.
> > > >
> > > > I've thought a bit about this, and I still don't have a good solution, so I'm
> > > > going to throw out my random thoughts and see if it helps us get to a good spot.
> > > >
> > > > 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use
> > > > memalloc_*_save/memalloc_*_restore, so instead the process is marked being in
> > > > these contexts. We could do something similar for FUSE, tho this gets hairy
> > > > with things that async off request handling to other threads (which is all of
> > > > the FUSE file systems we have internally). We'd need to have some way to
> > > > apply this to an entire process group, but this could be a workable solution.
> > > >
> > >
> > > I'm not sure how either of of both (GFP_ and memalloc_) would work for
> > > userspace allocations.
> > > Wouldn't we basically need to have a feature to disable memory
> > > allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup.
> > > Although even then, the file system might depend on other kernel
> > > resources (backend file system or block device or even network) that
> > > might do allocations on their own without the knowledge of the fuse server.
> > >
> >
> > Basically that only in the case that we're handling a request from memory
> > pressure we would invoke this, and then any allocation would automatically have
> > gfp_nofs protection because it's flagged at the task level.
> >
> > Again there's a lot of problems with this, like how do we set it for the task,
> > how does it work for threads etc.
> >
> > > > 2. Per-request timeouts. This is something we're planning on tackling for other
> > > > reasons, but it could fit nicely here to say "if this fuse fs has a
> > > > per-request timeout, skip the copy". That way we at least know we're upper
> > > > bound on how long we would be "deadlocked". I don't love this approach
> > > > because it's still a deadlock until the timeout elapsed, but it's an idea.
> > >
> > > Hmm, how do we know "this fuse fs has a per-request timeout"? I don't
> > > think we could trust initialization flags set by userspace.
> > >
> >
> > It would be controlled by the kernel. So at init time the fuse file system says
> > "my command timeout is 30 minutes." Then the kernel enforces this by having a
> > per-request timeout, and once that 30 minutes elapses we cancel the request and
> > EIO it. User space doesn't do anything beyond telling the kernel what it's
> > timeout is, so this would be safe.
> >
>
> Maybe that would be better to configure by mounter, similar to nfs -otimeo
> and maybe consider opt-in to returning ETIMEDOUT in this case.
> At least nfsd will pass that error to nfs client and nfs client will retry.
>
> Different applications (or network protocols) handle timeouts differently,
> so the timeout and error seems like a decision for the admin/mounter not
> for the fuse server, although there may be a fuse fs that would want to
> set the default timeout, as if to request the kernel to be its watchdog
> (i.e. do not expect me to take more than 30 min to handle any request).
Oh yeah for sure, I'm just saying for the purposes of allowing the FUSE daemon
to be a little riskier with system resources we base it off of wether it opts in
to command timeouts.
My plans are to have it be able to be set by the fuse daemon, or externally by a
sysadmin via sysfs. Thanks,
Josef