Hello,
Since v4.11-rc7 I can see the workqueue stops on my development/test system.
Git-bisecting tells me the suspicious commit is
c053b5a 2017-04-11 drm/i915: Don't call synchronize_rcu_expedited under struct_mutex
I am not sure whether this is the real cause or not of my problem, but I
have a question.
By the commit, the shrinker handlers ->scan_objects() and
->count_objects() both calls synchronize_rcu_expedited()
unconditionally. Is it a legal RCU bahavour?
I know dev->struct_mutex is unlocked now, but before the commit, these
two handlers were not calling synchronize_rcu_expedited().
J. R. Okajima
On Sun, Apr 30, 2017 at 03:07:58PM +0900, J. R. Okajima wrote:
> Hello,
>
> Since v4.11-rc7 I can see the workqueue stops on my development/test system.
> Git-bisecting tells me the suspicious commit is
> c053b5a 2017-04-11 drm/i915: Don't call synchronize_rcu_expedited under struct_mutex
>
> I am not sure whether this is the real cause or not of my problem, but I
> have a question.
> By the commit, the shrinker handlers ->scan_objects() and
> ->count_objects() both calls synchronize_rcu_expedited()
> unconditionally. Is it a legal RCU bahavour?
It's actually not legal because the workqueue RCU uses is not reclaim
safe, simply lockdep is unable to notice it because RCU won't use
flush_workqueue to wait for completion but it waits a wakeup from the
workqueue instead (and lockdep can't possibly notice that).
To fix it and allow both RCU and flush_workqueue (both needed to wait
the memory to be freed), i915 and RCU must both start using their own
private workqueue and not share the system-wide one and then set
WQ_MEM_RECLAIM on their private workqueue. (and of course they should
only call it when they're not recursing on the struct mutex)
The alternative is to drop it all and behave like in the mutex
recursion case. However reclaim cannot possibly throttle on the memory
freeing externally unless RCU is changed to stop using the system
workqueue so the idea that the throttling can be offloaded to the
reclaim code doesn't move the needle in terms of being able to
throttle. Perhaps no throttling is necessary at all and we can just go
inaccurate and it'll work fine though.
> I know dev->struct_mutex is unlocked now, but before the commit, these
> two handlers were not calling synchronize_rcu_expedited().
Yes I already reported this, my original fix was way more efficient
(and also safer considering the above) than what landed upstream. My
feedback was ignored though.
https://lists.freedesktop.org/archives/intel-gfx/2017-April/125414.html
Thanks,
Andrea
Thanx for the reply.
Andrea Arcangeli:
> Yes I already reported this, my original fix was way more efficient
> (and also safer considering the above) than what landed upstream. My
> feedback was ignored though.
>
> https://lists.freedesktop.org/archives/intel-gfx/2017-April/125414.html
I see.
Actually on my test system for v4.11-rc8, kthreadd, kworker, kswapd and
others all stopped working due to the synchronize_rcu_expedited call
from i915_gem_shrinker_count. It is definitly a show stopper for me as
an i915 user.
It was a few weeks ago when you posted. It is a pity the fix was not
merged before v4.11 comes out. I know v4.11 will appear soon. So I'd ask
i915 developers, would you test Andrea Arcangeli's fix and release it as
v4.11.1 as soon as possible?
J. R. Okajima
Joonas Lahtinen:
> Don't worry, it's not lost. It was merged to drm-intel-fixes and thus is in the pipeline.
>
> There were some unexpected delays getting fixes in, sorry for that.
Thanx, I got linux-v4.12-rc4 and it contains
4681ee2 2017-05-18 drm/i915: Do not sync RCU during shrinking
How about v4.11.x series?
I got v4.11.5, but it doesn't contain the fix.
Do you have a plan?
J. R. Okajima
On Thu, 15 Jun 2017, "J. R. Okajima" <[email protected]> wrote:
> Thanx, I got linux-v4.12-rc4 and it contains
> 4681ee2 2017-05-18 drm/i915: Do not sync RCU during shrinking
>
> How about v4.11.x series?
> I got v4.11.5, but it doesn't contain the fix.
> Do you have a plan?
The upstream commit has the proper Cc: stable and Fixes: tags in place,
it just takes a while for the patches to trickle to stable kernels.
BR,
Jani.
--
Jani Nikula, Intel Open Source Technology Center
Jani Nikula:
> On Thu, 15 Jun 2017, "J. R. Okajima" <[email protected]> wrote:
> > How about v4.11.x series?
> > I got v4.11.5, but it doesn't contain the fix.
> > Do you have a plan?
>
> The upstream commit has the proper Cc: stable and Fixes: tags in place,
> it just takes a while for the patches to trickle to stable kernels.
Now I can see the fix exists in v4.11.7.
Thank you.
J. R. Okajima