Hi Andrew,
Today's linux-next merge of the akpm tree got a conflict in
fs/proc/generic.c between several commits from the vfs tree and commit
"procfs: improve scaling in proc" from the akpm tree.
I just dropped the akpm tree patch (and the following
"procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
--
Cheers,
Stephen Rothwell [email protected]
On Thu, 4 Apr 2013 17:26:48 +1100 Stephen Rothwell <[email protected]> wrote:
> Hi Andrew,
>
> Today's linux-next merge of the akpm tree got a conflict in
> fs/proc/generic.c between several commits from the vfs tree and commit
> "procfs: improve scaling in proc" from the akpm tree.
>
> I just dropped the akpm tree patch (and the following
> "procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
Well perhaps the vfs tree should start paying some attention to the
rest of the world, particularly after -rc5.
On Wed, 3 Apr 2013 23:56:34 -0700 Andrew Morton <[email protected]> wrote:
> On Thu, 4 Apr 2013 17:26:48 +1100 Stephen Rothwell <[email protected]> wrote:
>
> > Hi Andrew,
> >
> > Today's linux-next merge of the akpm tree got a conflict in
> > fs/proc/generic.c between several commits from the vfs tree and commit
> > "procfs: improve scaling in proc" from the akpm tree.
> >
> > I just dropped the akpm tree patch (and the following
> > "procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
>
> Well perhaps the vfs tree should start paying some attention to the
> rest of the world, particularly after -rc5.
I can't even find this "lift sb_start_write() out of ->write()". Not on fsdevel,
not on lkml. What the heck is it and why was it so important?
On Wed, Apr 03, 2013 at 11:56:34PM -0700, Andrew Morton wrote:
> On Thu, 4 Apr 2013 17:26:48 +1100 Stephen Rothwell <[email protected]> wrote:
>
> > Hi Andrew,
> >
> > Today's linux-next merge of the akpm tree got a conflict in
> > fs/proc/generic.c between several commits from the vfs tree and commit
> > "procfs: improve scaling in proc" from the akpm tree.
> >
> > I just dropped the akpm tree patch (and the following
> > "procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
>
> Well perhaps the vfs tree should start paying some attention to the
> rest of the world, particularly after -rc5.
I'm sorry, but... not in this case. There are seriously nasty races around
remove_proc_entry()/proc_reg_release() and the whole area needs a rewrite.
Tentative fix is in vfs.git#experimental; I hadn't pushed it into #for-next
yet, but Nathan's patches are definitely going to buggered by any realistic
solution. For now I'm going for the dumbest variant possible; ->pde_users
will eventually become atomic_t, but it'll be modified by
atomic_inc_unless_negative(), not atomic_inc() and ->proc_fops won't be
zeroed at all. But before we do that, I want to get that sucker tested and
stabilized - the whole thing is sufficiently convoluted for those races to
stay unnoticed for a long time in the first place.
I am aware of those patches; it's just that they'll need to be redone - the
code being optimized is broken and needs to be fixed. Longer term, I want
to lift the whole thing into VFS proper; it *can* be done with minimal
overhead and it'll get us a large part of revoke(2) if done right. Basically,
what I want is a pair of new types - struct revoke and struct revokable.
* struct file gets a pointer to struct revoke; set in ->open() by
file_revokable(file, revokable) and never changed afterwards. No locking
is needed to check it.
* struct revoke contains a pointer back to struct file (never changed)
+ pointer to struct revokable (RCU-protected, zeroed on revoke) + mutex/count
(serializes __fput() vs. revoke() deciding who's going to call ->release() and
who'll be waiting; see pdeo->mutex and pdep->count in #experimental) + cyclic
list anchored in struct revokable (list_del_init() after ->release() had been
done, under revoke->mutex and revokable->lock).
* struct revokable contains an anchor for aforementioned cyclic
list + spinlock + atomic_t in_use (a-la pde->pde_users) + pointer to
completion + one method (->kick(); see below). Freed via RCU.
* start_using(file) is an inlined helper, returning true if
file->f_revoke is NULL; if it's not NULL, we do rcu_read_lock() and look
at file->f_revoke->revokable. If it's NULL - rcu_read_unlock() and return
false (file had been revoked). If it's not, atomic_inc_unless_negative()
of revokable->in_use. Then rcu_read_unlock() and return - true if
->in_use used to be non-negative (file not revoked, revoke will wait) and
false if it was negative (file in process of being revoked).
* stop_using(file) - inlined helper, does nothing if file->f_revoke
is NULL, otherwise decrements revokable->in_use and if it's reached
BIAS (large negative), complete(revokable->completion).
* all normal method calls (everything except ->release()) are
turned into
if (likely(start_using(file)) {
res = method call
stop_using(file);
} else {
res = <appropriate for method>;
}
Overhead for non-revokable files is trivial - we just check one field in
struct file and if it's in the same cacheline as ->f_op, we are not going
to see any real delays.
* new helper - release_revoke(); similar to close_pdeo() in
vfs.git#experimenatal. __fput() checks ->f_revoke and, if that sucker's
non-NULL, does rcu_read_lock(), checks ->revokable, grabs ->lock on it
and does release_revoke(). If ->f_revoke is NULL, call ->release() as
we do now. Note that unlike struct pde_opener, these guys would be
created with ->count equal to 1 - IOW, file->f_revoke would contribute to it.
* do_revoke(revokable) starts with setting ->completion and adding
BIAS to ->in_use; if it non-zero prior to that, call ->kick(revokable) and
wait for completion. ->kick() should essentially wake up those who are
sleeping in ->read() or ->write() and make them return, be it with EAGAIN
or short read/write. Empty for procfs, for something like TTY it should
imitate hangup. That's where driver-specific logics in revoke(2) would
live. Once do_revoke() has finished waiting, we know that nobody is in
method calls (except possibly ->release()) and nobody will manage to enter
them from now on. We grab ->lock, pick the first struct revoke from the
list, do release_revoke() to it and keep doing that to these guys until
none is left.
procfs would have struct revokable embedded into proc_dir_entry, with
freeing of those guys RCUd. It would have file_revokable() done in
proc_reg_open() (that would do allocation of struct revoke, adding it
to the cyclic list, etc.) and set ->f_op to ->proc_fops; no wrappers
needed anymore. All file_operations instances fed to proc_create()
et.al. would lose ->owner - it's already not needed for those, actually.
remove_proc_entry()/remove_proc_subtree() would call do_revoke() on
everything we are removing.
Getting from there to working revoke(2) is probably not too hard; the
interesting part is what should we associate struct revokable with and
what should revoke(2) in progress do to new attempts to open the damn
device. Hell knows - not sure what's the right semantics here.
On Thu, Apr 04, 2013 at 12:02:53AM -0700, Andrew Morton wrote:
> > Well perhaps the vfs tree should start paying some attention to the
> > rest of the world, particularly after -rc5.
>
> I can't even find this "lift sb_start_write() out of ->write()". Not on fsdevel,
> not on lkml. What the heck is it and why was it so important?
Deadlocks around splice; see the threads re overlayfs/unionmount/aufs and
deadlocks in their copyup implementations. See also XFS freeze-related
deadlocks, etc.
The thing is, sb_start_write() is pretty high in locking hierarchy (outside
->i_mutex, etc.), but ->splice_write() and friends had it buried pretty
deep. With distinctly unpleasant results, including ->..._write() instances
using generic ones (which took the lock) *and* doing some IO outside of those
(ext4, for example; ocfs2 also looked fishy in that respect, IIRC).
The obvious solution is to lift taking that lock out of the methods, which
had been done. It had been discussed on fsdevel and sat in #experimental for
several weeks; time for it to go into #for-next.
On 04/04/2013 03:02 AM, Al Viro wrote:
> On Wed, Apr 03, 2013 at 11:56:34PM -0700, Andrew Morton wrote:
>> On Thu, 4 Apr 2013 17:26:48 +1100 Stephen Rothwell <[email protected]> wrote:
>>
>>> Hi Andrew,
>>>
>>> Today's linux-next merge of the akpm tree got a conflict in
>>> fs/proc/generic.c between several commits from the vfs tree and commit
>>> "procfs: improve scaling in proc" from the akpm tree.
>>>
>>> I just dropped the akpm tree patch (and the following
>>> "procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
>> Well perhaps the vfs tree should start paying some attention to the
>> rest of the world, particularly after -rc5.
> I'm sorry, but... not in this case. There are seriously nasty races around
> remove_proc_entry()/proc_reg_release() and the whole area needs a rewrite.
> Tentative fix is in vfs.git#experimental; I hadn't pushed it into #for-next
> yet, but Nathan's patches are definitely going to buggered by any realistic
> solution.
In this case I will resubmit my first patch for moving the kfree in
proc_reg_release.
This moves a kfree outside a spinlock to help scaling on larger (512 core)
systems. This should be some relief until we can move the section to use
the rcu.
I ran a simple test which just reads from /proc/cpuinfo.
Lower is better, as you can see the worst case scenario is improved.
baseline moved kfree
tasks read-sec read-sec
1 0.0141 0.0141
2 0.0140 0.0140
4 0.0140 0.0141
8 0.0145 0.0145
16 0.0553 0.0548
32 0.1688 0.1622
64 0.5017 0.3856
128 1.7005 0.9710
256 5.2513 2.6519
512 8.0529 6.2976
Cc: "Eric W. Biederman" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: <[email protected]>
Acked-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
---
fs/proc/inode.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 439ae688..863608b 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -403,11 +403,10 @@ static int proc_reg_release(struct inode *inode, struct file *file)
}
pde->pde_users++;
release = pde->proc_fops->release;
- if (pdeo) {
+ if (pdeo)
list_del(&pdeo->lh);
- kfree(pdeo);
- }
spin_unlock(&pde->pde_unload_lock);
+ kfree(pdeo);
if (release)
rv = release(inode, file);
--
1.8.1.2
On Thu, Apr 04, 2013 at 10:53:39AM -0500, Nathan Zimmer wrote:
> This moves a kfree outside a spinlock to help scaling on larger (512 core)
> systems. This should be some relief until we can move the section to use
> the rcu.
Umm... That'll get wrecked as soon as fixes from #experimental go in;
FWIW, I'd probably make close_pdeo() return pdeo or NULL, depending on
whether we want it freed. With kfree() itself taken to callers.
But there's much bigger fish to fry there - turn use_pde() into
return atomic_inc_unless_negative(&pde->pde_users), unuse_pde() into
if (atomic_dec_return(&pde->pde_users) == BIAS) complete(pde->....)
and make sure entry_rundown() sets completion *before* adding BIAS
to pde_users and waits for it only if the sum was equal to BIAS.
The spinlock is still needed, but only on the "now taking care of
any pdeo that might still be around" side of things - it protects
pdeo list.
Again, see the last two commits of vfs.git#experimental. I'd certainly
appreciate any extra eyes on that sucker...
On 04/04/2013 11:11 AM, Al Viro wrote:
> On Thu, Apr 04, 2013 at 10:53:39AM -0500, Nathan Zimmer wrote:
>> This moves a kfree outside a spinlock to help scaling on larger (512 core)
>> systems. This should be some relief until we can move the section to use
>> the rcu.
> Umm... That'll get wrecked as soon as fixes from #experimental go in;
> FWIW, I'd probably make close_pdeo() return pdeo or NULL, depending on
> whether we want it freed. With kfree() itself taken to callers.
> But there's much bigger fish to fry there - turn use_pde() into
> return atomic_inc_unless_negative(&pde->pde_users), unuse_pde() into
> if (atomic_dec_return(&pde->pde_users) == BIAS) complete(pde->....)
> and make sure entry_rundown() sets completion *before* adding BIAS
> to pde_users and waits for it only if the sum was equal to BIAS.
> The spinlock is still needed, but only on the "now taking care of
> any pdeo that might still be around" side of things - it protects
> pdeo list.
>
> Again, see the last two commits of vfs.git#experimental. I'd certainly
> appreciate any extra eyes on that sucker...
Ok I am cloning the tree now.
It does look like the patches would conflict.
I'll run some tests and take a deeper look.
On Thu, Apr 04, 2013 at 12:12:05PM -0500, Nathan Zimmer wrote:
> Ok I am cloning the tree now.
> It does look like the patches would conflict.
> I'll run some tests and take a deeper look.
FWIW, I've just pushed there a tentative patch that switches to hopefully
saner locking (head should be at cb673c115c1f99d3480471ca5d8cb3f89a1e3bee).
Is that more or less what you want wrt spinlock contention?
One note: for any given pde_opener, close_pdeo() can be called at most
by two threads - final fput() and remove_proc_entry() resp. I think
the use of completion + flag is safe there; pde->pde_unload_lock
should serialize the critical areas.
Hi Al,
On Thu, 4 Apr 2013 09:10:11 +0100 Al Viro <[email protected]> wrote:
>
> On Thu, Apr 04, 2013 at 12:02:53AM -0700, Andrew Morton wrote:
>
> > > Well perhaps the vfs tree should start paying some attention to the
> > > rest of the world, particularly after -rc5.
> >
> > I can't even find this "lift sb_start_write() out of ->write()". Not on fsdevel,
> > not on lkml. What the heck is it and why was it so important?
>
> Deadlocks around splice; see the threads re overlayfs/unionmount/aufs and
> deadlocks in their copyup implementations. See also XFS freeze-related
> deadlocks, etc.
>
> The thing is, sb_start_write() is pretty high in locking hierarchy (outside
> ->i_mutex, etc.), but ->splice_write() and friends had it buried pretty
> deep. With distinctly unpleasant results, including ->..._write() instances
> using generic ones (which took the lock) *and* doing some IO outside of those
> (ext4, for example; ocfs2 also looked fishy in that respect, IIRC).
>
> The obvious solution is to lift taking that lock out of the methods, which
> had been done. It had been discussed on fsdevel and sat in #experimental for
> several weeks; time for it to go into #for-next.
It would have been useful to put something like that in the commit message ...
--
Cheers,
Stephen Rothwell [email protected]
On 04/04/2013 03:44 PM, Al Viro wrote:
> On Thu, Apr 04, 2013 at 12:12:05PM -0500, Nathan Zimmer wrote:
>
>> Ok I am cloning the tree now.
>> It does look like the patches would conflict.
>> I'll run some tests and take a deeper look.
> FWIW, I've just pushed there a tentative patch that switches to hopefully
> saner locking (head should be at cb673c115c1f99d3480471ca5d8cb3f89a1e3bee).
> Is that more or less what you want wrt spinlock contention?
>
> One note: for any given pde_opener, close_pdeo() can be called at most
> by two threads - final fput() and remove_proc_entry() resp. I think
> the use of completion + flag is safe there; pde->pde_unload_lock
> should serialize the critical areas.
Something isn't quite right. I keep getting hung during boot.
dracut: Mounted root filesystem /dev/sda8
dracut: Switching root
I'll try to get some more info on a smaller box.
On Fri, Apr 05, 2013 at 12:05:26PM -0500, Nathan Zimmer wrote:
> On 04/04/2013 03:44 PM, Al Viro wrote:
> >On Thu, Apr 04, 2013 at 12:12:05PM -0500, Nathan Zimmer wrote:
> >
> >>Ok I am cloning the tree now.
> >>It does look like the patches would conflict.
> >>I'll run some tests and take a deeper look.
> >FWIW, I've just pushed there a tentative patch that switches to hopefully
> >saner locking (head should be at cb673c115c1f99d3480471ca5d8cb3f89a1e3bee).
> >Is that more or less what you want wrt spinlock contention?
> >
> >One note: for any given pde_opener, close_pdeo() can be called at most
> >by two threads - final fput() and remove_proc_entry() resp. I think
> >the use of completion + flag is safe there; pde->pde_unload_lock
> >should serialize the critical areas.
>
> Something isn't quite right. I keep getting hung during boot.
> dracut: Mounted root filesystem /dev/sda8
> dracut: Switching root
>
> I'll try to get some more info on a smaller box.
Umm... Try to add WARN_ON(1) in entry_rundown(), just to see what's
getting hit (don't bother with entry name, stack trace will be enough).
On 04/05/2013 12:36 PM, Al Viro wrote:
> On Fri, Apr 05, 2013 at 12:05:26PM -0500, Nathan Zimmer wrote:
>> On 04/04/2013 03:44 PM, Al Viro wrote:
>>> On Thu, Apr 04, 2013 at 12:12:05PM -0500, Nathan Zimmer wrote:
>>>
>>>> Ok I am cloning the tree now.
>>>> It does look like the patches would conflict.
>>>> I'll run some tests and take a deeper look.
>>> FWIW, I've just pushed there a tentative patch that switches to hopefully
>>> saner locking (head should be at cb673c115c1f99d3480471ca5d8cb3f89a1e3bee).
>>> Is that more or less what you want wrt spinlock contention?
>>>
>>> One note: for any given pde_opener, close_pdeo() can be called at most
>>> by two threads - final fput() and remove_proc_entry() resp. I think
>>> the use of completion + flag is safe there; pde->pde_unload_lock
>>> should serialize the critical areas.
>> Something isn't quite right. I keep getting hung during boot.
>> dracut: Mounted root filesystem /dev/sda8
>> dracut: Switching root
>>
>> I'll try to get some more info on a smaller box.
> Umm... Try to add WARN_ON(1) in entry_rundown(), just to see what's
> getting hit (don't bother with entry name, stack trace will be enough).
That didn't produce anything. I'll run some bisections over the weekend
and see what I can sort out.
On Fri, Apr 05, 2013 at 03:56:17PM -0500, Nathan Zimmer wrote:
> That didn't produce anything. I'll run some bisections over the
> weekend and see what I can sort out.
*Ugh*
I'd try to build with DEBUG_KMEMLEAK and slapped printks on the entry
and exit from close_pdeo(). If that doesn't show anything interesting,
it's probably unrelated to procfs...
On 04/05/2013 04:00 PM, Al Viro wrote:
> On Fri, Apr 05, 2013 at 03:56:17PM -0500, Nathan Zimmer wrote:
>
>> That didn't produce anything. I'll run some bisections over the
>> weekend and see what I can sort out.
> *Ugh*
>
> I'd try to build with DEBUG_KMEMLEAK and slapped printks on the entry
> and exit from close_pdeo(). If that doesn't show anything interesting,
> it's probably unrelated to procfs...
My bisection pointed me to this commit: e41efbf13c15
At this point I am assuming my issue is unrelated to procfs.
On Mon, Apr 08, 2013 at 10:34:07AM -0500, Nathan Zimmer wrote:
> On 04/05/2013 04:00 PM, Al Viro wrote:
> >On Fri, Apr 05, 2013 at 03:56:17PM -0500, Nathan Zimmer wrote:
> >
> >>That didn't produce anything. I'll run some bisections over the
> >>weekend and see what I can sort out.
> >*Ugh*
> >
> >I'd try to build with DEBUG_KMEMLEAK and slapped printks on the entry
> >and exit from close_pdeo(). If that doesn't show anything interesting,
> >it's probably unrelated to procfs...
> My bisection pointed me to this commit: e41efbf13c15
> At this point I am assuming my issue is unrelated to procfs.
Huh? That commit simply moves three functions and one struct from one file to
another...
On 04/08/2013 10:58 AM, Al Viro wrote:
> On Mon, Apr 08, 2013 at 10:34:07AM -0500, Nathan Zimmer wrote:
>> On 04/05/2013 04:00 PM, Al Viro wrote:
>>> On Fri, Apr 05, 2013 at 03:56:17PM -0500, Nathan Zimmer wrote:
>>>
>>>> That didn't produce anything. I'll run some bisections over the
>>>> weekend and see what I can sort out.
>>> *Ugh*
>>>
>>> I'd try to build with DEBUG_KMEMLEAK and slapped printks on the entry
>>> and exit from close_pdeo(). If that doesn't show anything interesting,
>>> it's probably unrelated to procfs...
>> My bisection pointed me to this commit: e41efbf13c15
>> At this point I am assuming my issue is unrelated to procfs.
> Huh? That commit simply moves three functions and one struct from one file to
> another...
Yea, I was hoping it made more sense to you then it did do me.
I have some lab time later. I'll verify that is the problem commit.
On 04/08/2013 10:58 AM, Al Viro wrote:
> On Mon, Apr 08, 2013 at 10:34:07AM -0500, Nathan Zimmer wrote:
>> On 04/05/2013 04:00 PM, Al Viro wrote:
>>> On Fri, Apr 05, 2013 at 03:56:17PM -0500, Nathan Zimmer wrote:
>>>
>>>> That didn't produce anything. I'll run some bisections over the
>>>> weekend and see what I can sort out.
>>> *Ugh*
>>>
>>> I'd try to build with DEBUG_KMEMLEAK and slapped printks on the entry
>>> and exit from close_pdeo(). If that doesn't show anything interesting,
>>> it's probably unrelated to procfs...
>> My bisection pointed me to this commit: e41efbf13c15
>> At this point I am assuming my issue is unrelated to procfs.
> Huh? That commit simply moves three functions and one struct from one file to
> another...
Further digging seems to indicate 9984d7394618df9, the one right after
the commit I previously identified.
Not sure what I did wrong with my bisect to put it off by one.
On Mon, Apr 08, 2013 at 03:52:08PM -0500, Nathan Zimmer wrote:
> Further digging seems to indicate 9984d7394618df9, the one right
> after the commit I previously identified.
> Not sure what I did wrong with my bisect to put it off by one.
Ugh... Can't reproduce here ;-/ Could you give more details on your
setup?
On Mon, Apr 08, 2013 at 10:23:27PM +0100, Al Viro wrote:
> On Mon, Apr 08, 2013 at 03:52:08PM -0500, Nathan Zimmer wrote:
>
> > Further digging seems to indicate 9984d7394618df9, the one right
> > after the commit I previously identified.
> > Not sure what I did wrong with my bisect to put it off by one.
>
> Ugh... Can't reproduce here ;-/ Could you give more details on your
> setup?
Anyway, I've just pushed a splitup of that commit (carved in 3 pieces)
into vfs.git#pipe-splitup; could you check which part triggers that
hang? Should propagate in a few...
On Mon, Apr 08, 2013 at 10:23:27PM +0100, Al Viro wrote:
> On Mon, Apr 08, 2013 at 03:52:08PM -0500, Nathan Zimmer wrote:
>
> > Further digging seems to indicate 9984d7394618df9, the one right
> > after the commit I previously identified.
> > Not sure what I did wrong with my bisect to put it off by one.
>
> Ugh... Can't reproduce here ;-/ Could you give more details on your
> setup?
It is a fairly small system, 12 processors.
It is based on rhel 6.4.
Here is my config if that helps.
Other then that I am not sure what you need.
On 04/08/2013 03:48 PM, Al Viro wrote:
> On Mon, Apr 08, 2013 at 10:23:27PM +0100, Al Viro wrote:
>> On Mon, Apr 08, 2013 at 03:52:08PM -0500, Nathan Zimmer wrote:
>>
>>> Further digging seems to indicate 9984d7394618df9, the one right
>>> after the commit I previously identified.
>>> Not sure what I did wrong with my bisect to put it off by one.
>>
>> Ugh... Can't reproduce here ;-/ Could you give more details on your
>> setup?
>
> Anyway, I've just pushed a splitup of that commit (carved in 3 pieces)
> into vfs.git#pipe-splitup; could you check which part triggers that
> hang? Should propagate in a few...
It looks like "pipe: unify ->release() and ->open()" introduces the
problem. Note that I had to add a prototype for fifo_open() before the
structs that reference it for that commit to compile.
Hi,
On Mon, Apr 8, 2013 at 3:17 PM, Stephen Warren <[email protected]> wrote:
>> Anyway, I've just pushed a splitup of that commit (carved in 3 pieces)
>> into vfs.git#pipe-splitup; could you check which part triggers that
>> hang? Should propagate in a few...
>
> It looks like "pipe: unify ->release() and ->open()" introduces the
> problem. Note that I had to add a prototype for fifo_open() before the
> structs that reference it for that commit to compile.
It sounds like Stephen has provided you the info you needed so not
doing any extra testing now, but I figured I'd chime in that I hit
problems this morning with linux-next and it appears to be the same
thing. I did a revert of 9984d7394618df9 and (plus a revert of a
handful of patches to the same file) and problems are resolved. The
failure case is really weird in that everything works well booting to
a simple /bin/bash but fails when you do more complex tasks.
Anyway: If you need some extra testing feel free to CC me.
-Doug
On Mon, Apr 08, 2013 at 04:17:17PM -0600, Stephen Warren wrote:
> On 04/08/2013 03:48 PM, Al Viro wrote:
> > On Mon, Apr 08, 2013 at 10:23:27PM +0100, Al Viro wrote:
> >> On Mon, Apr 08, 2013 at 03:52:08PM -0500, Nathan Zimmer wrote:
> >>
> >>> Further digging seems to indicate 9984d7394618df9, the one right
> >>> after the commit I previously identified.
> >>> Not sure what I did wrong with my bisect to put it off by one.
> >>
> >> Ugh... Can't reproduce here ;-/ Could you give more details on your
> >> setup?
> >
> > Anyway, I've just pushed a splitup of that commit (carved in 3 pieces)
> > into vfs.git#pipe-splitup; could you check which part triggers that
> > hang? Should propagate in a few...
>
> It looks like "pipe: unify ->release() and ->open()" introduces the
> problem. Note that I had to add a prototype for fifo_open() before the
> structs that reference it for that commit to compile.
Very interesting... Just in case, could you try this on top of that
branch and see if it triggers?
diff --git a/fs/pipe.c b/fs/pipe.c
index 8ce279b..b6cd51b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1022,6 +1022,11 @@ static int fifo_open(struct inode *inode, struct file *filp)
/* We can only do regular read/write on fifos */
filp->f_mode &= (FMODE_READ | FMODE_WRITE);
+ if (inode->i_sb->s_magic == PIPEFS_MAGIC) {
+ WARN_ON(filp->f_flags & O_NONBLOCK);
+ filp->f_flags &= ~O_NONBLOCK;
+ }
+
switch (filp->f_mode) {
case FMODE_READ:
/*
On Mon, Apr 08, 2013 at 11:46:37PM +0100, Al Viro wrote:
> Very interesting... Just in case, could you try this on top of that
> branch and see if it triggers?
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 8ce279b..b6cd51b 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1022,6 +1022,11 @@ static int fifo_open(struct inode *inode, struct file *filp)
> /* We can only do regular read/write on fifos */
> filp->f_mode &= (FMODE_READ | FMODE_WRITE);
>
> + if (inode->i_sb->s_magic == PIPEFS_MAGIC) {
> + WARN_ON(filp->f_flags & O_NONBLOCK);
> + filp->f_flags &= ~O_NONBLOCK;
> + }
*d'oh*
OK, I'm an idiot. I see what's going on now... Give me a few and I'll push
a fix.
On Mon, Apr 08, 2013 at 03:45:31PM -0700, Doug Anderson wrote:
> Hi,
>
> On Mon, Apr 8, 2013 at 3:17 PM, Stephen Warren <[email protected]> wrote:
> >> Anyway, I've just pushed a splitup of that commit (carved in 3 pieces)
> >> into vfs.git#pipe-splitup; could you check which part triggers that
> >> hang? Should propagate in a few...
> >
> > It looks like "pipe: unify ->release() and ->open()" introduces the
> > problem. Note that I had to add a prototype for fifo_open() before the
> > structs that reference it for that commit to compile.
>
> It sounds like Stephen has provided you the info you needed so not
> doing any extra testing now, but I figured I'd chime in that I hit
> problems this morning with linux-next and it appears to be the same
> thing. I did a revert of 9984d7394618df9 and (plus a revert of a
> handful of patches to the same file) and problems are resolved. The
> failure case is really weird in that everything works well booting to
> a simple /bin/bash but fails when you do more complex tasks.
>
> Anyway: If you need some extra testing feel free to CC me.
Folks, see if vfs.git#experimental works for you; the PITA had apparently
been caused by change of open() semantics for /proc/<pid>/fd/<some_pipe> -
it started to behave like a FIFO, i.e. wait for peer to show up. Normally
that's not a problem, but if you have closed e.g. the write end of a pipe
and try to open /proc/<pid>/fd/<read_end_of_pipe>, you'll get open() waiting
for writers to appear. Which isn't what we used to do here (open succeeded
immediately) and apparently that was enough to trip drakut.
Branch head should be at 574179469f7370aadb9cbac1ceca7c3723c17bee.
On 04/08/2013 05:06 PM, Al Viro wrote:
> On Mon, Apr 08, 2013 at 03:45:31PM -0700, Doug Anderson wrote:
>> Hi,
>>
>> On Mon, Apr 8, 2013 at 3:17 PM, Stephen Warren <[email protected]> wrote:
>>>> Anyway, I've just pushed a splitup of that commit (carved in 3 pieces)
>>>> into vfs.git#pipe-splitup; could you check which part triggers that
>>>> hang? Should propagate in a few...
>>>
>>> It looks like "pipe: unify ->release() and ->open()" introduces the
>>> problem. Note that I had to add a prototype for fifo_open() before the
>>> structs that reference it for that commit to compile.
>>
>> It sounds like Stephen has provided you the info you needed so not
>> doing any extra testing now, but I figured I'd chime in that I hit
>> problems this morning with linux-next and it appears to be the same
>> thing. I did a revert of 9984d7394618df9 and (plus a revert of a
>> handful of patches to the same file) and problems are resolved. The
>> failure case is really weird in that everything works well booting to
>> a simple /bin/bash but fails when you do more complex tasks.
>>
>> Anyway: If you need some extra testing feel free to CC me.
>
> Folks, see if vfs.git#experimental works for you; the PITA had apparently
> been caused by change of open() semantics for /proc/<pid>/fd/<some_pipe> -
> it started to behave like a FIFO, i.e. wait for peer to show up. Normally
> that's not a problem, but if you have closed e.g. the write end of a pipe
> and try to open /proc/<pid>/fd/<read_end_of_pipe>, you'll get open() waiting
> for writers to appear. Which isn't what we used to do here (open succeeded
> immediately) and apparently that was enough to trip drakut.
>
> Branch head should be at 574179469f7370aadb9cbac1ceca7c3723c17bee.
That branch works.
Thanks!
Al,
On Mon, Apr 8, 2013 at 4:06 PM, Al Viro <[email protected]> wrote:
> Folks, see if vfs.git#experimental works for you; the PITA had apparently
> been caused by change of open() semantics for /proc/<pid>/fd/<some_pipe> -
> it started to behave like a FIFO, i.e. wait for peer to show up. Normally
> that's not a problem, but if you have closed e.g. the write end of a pipe
> and try to open /proc/<pid>/fd/<read_end_of_pipe>, you'll get open() waiting
> for writers to appear. Which isn't what we used to do here (open succeeded
> immediately) and apparently that was enough to trip drakut.
>
> Branch head should be at 574179469f7370aadb9cbac1ceca7c3723c17bee.
That branch booted fine for me and didn't show any problems.
I wasn't easily able to merge onto linux-next and test there though.
I tried applying these the 4 top commits of your branch to
"next-20130408" and it didn't solve my problems. A full merge of your
branch to linux-next showed conflicts and I didn't dig.
-Doug
On 04/08/2013 06:46 PM, Doug Anderson wrote:
> Al,
>
> On Mon, Apr 8, 2013 at 4:06 PM, Al Viro <[email protected]> wrote:
>> Folks, see if vfs.git#experimental works for you; the PITA had apparently
>> been caused by change of open() semantics for /proc/<pid>/fd/<some_pipe> -
>> it started to behave like a FIFO, i.e. wait for peer to show up. Normally
>> that's not a problem, but if you have closed e.g. the write end of a pipe
>> and try to open /proc/<pid>/fd/<read_end_of_pipe>, you'll get open() waiting
>> for writers to appear. Which isn't what we used to do here (open succeeded
>> immediately) and apparently that was enough to trip drakut.
>>
>> Branch head should be at 574179469f7370aadb9cbac1ceca7c3723c17bee.
> That branch booted fine for me and didn't show any problems.
>
> I wasn't easily able to merge onto linux-next and test there though.
> I tried applying these the 4 top commits of your branch to
> "next-20130408" and it didn't solve my problems. A full merge of your
> branch to linux-next showed conflicts and I didn't dig.
>
> -Doug
I booted great for me too.
Also the numbers from the scaling test are improved also, at least on my
128p box.
I'll verify on a 512 or larger box when I have the chance.
However I think I still should to provide some relief for the older kernels.
The issue is fixed with Linux-Next (next-20130410).
- Sedat -
[1] http://marc.info/?l=linux-next&m=136559043516192&w=2