Date: Thu, 20 Oct 2016 16:35:30 +0200
Message-ID: <s5hshrrytot.wl-tiwai@suse.de>
From: Takashi Iwai <tiwai@suse.de>
To: Ville =?UTF-8?B?U3lyasOkbMOk?= <ville.syrjala@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org,
        Daniel Vetter <daniel.vetter@ffwll.ch>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] drm/fb-helper: Fix race between deferred_io worker and dirty updater
In-Reply-To: <20161020141725.GU4329@intel.com>
References: <20161020132055.9646-1-tiwai@suse.de>
        <20161020132814.GT4329@intel.com>
        <s5h37jr16rt.wl-tiwai@suse.de>
        <20161020141725.GU4329@intel.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI/1.14.6 (Maruoka)
 FLIM/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL/10.8 Emacs/24.5
 (x86_64-suse-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2318
Lines: 54

On Thu, 20 Oct 2016 16:17:25 +0200,
Ville Syrjälä wrote:
> 
> On Thu, Oct 20, 2016 at 03:36:54PM +0200, Takashi Iwai wrote:
> > On Thu, 20 Oct 2016 15:28:14 +0200,
> > Ville Syrjälä wrote:
> > > 
> > > On Thu, Oct 20, 2016 at 03:20:55PM +0200, Takashi Iwai wrote:
> > > > Since 4.7 kernel, we've seen the error messages like
> > > > 
> > > >  kernel: [TTM] Buffer eviction failed
> > > >  kernel: qxl 0000:00:02.0: object_init failed for (4026540032, 0x00000001)
> > > >  kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
> > > > 
> > > > on QXL when switching and accessing on VT.  The culprit was the generic
> > > > deferred_io code (qxl driver switched to it since 4.7).  There is a
> > > > race between the dirty clip update and the call of callback.
> > > > 
> > > > In drm_fb_helper_dirty(), the dirty clip is updated in the spinlock,
> > > > while it kicks off the update worker outside the spinlock.  Meanwhile
> > > > the update worker clears the dirty clip in the spinlock, too.  Thus,
> > > > when drm_fb_helper_dirty() is called concurrently, schedule_work() is
> > > > called after the clip is cleared in the first worker call.
> > > 
> > > Why does that matter? The first worker should have done all the
> > > necessary work already, no?
> > 
> > Before the first call, it clears the clip and passes the copied clip
> > to the callback.  Then the second call will be with the cleared and
> > untouched clip, i.e. with x1=~0.  This confuses
> > qxl_framebuffer_dirty().
> > 
> > Of course, we can filter out in the callback side by checking the
> > clip.  It was actually my first version.  But basically it's a race
> > and should be covered better in the caller side.
> 
> The race is still there AFAICS. The worker may already be executing but
> not yet in the critical section, at which point drm_fb_helper_dirty()
> will expand the dirty rectangle, and schedule another work. So the first
> worker will already see the expanded rectangle, and second worker will
> get zilch.

Hrm, right, there's a slight race window there.

> I think the only good fix is to have the worker validate the dirty
> rectangle before calling the driver.

OK, let me cook it quickly.  (It was actually the second version of
the patch I wrote, and I sent the third one :)


thanks,

Takashi