From: Jani Nikula <jani.nikula@linux.intel.com>
To: Lyude <cpaul@redhat.com>, intel-gfx@lists.freedesktop.org,
        dri-devel@lists.freedesktop.org
Cc: arthur.j.runyan@intel.com,
        Ville =?utf-8?B?U3lyasOkbMOk?= <ville.syrjala@linux.intel.com>,
        Lyude <cpaul@redhat.com>, David Airlie <airlied@linux.ie>,
        open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 1/2 RESEND] drm/dp_helper: add workarounds from intel_dp_dpcd_read_wake()
In-Reply-To: <1458761593-13561-1-git-send-email-cpaul@redhat.com>
Organization: Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
References: <1458761417-12165-1-git-send-email-cpaul@redhat.com> <1458761593-13561-1-git-send-email-cpaul@redhat.com>
User-Agent: Notmuch/0.21+80~g3ff6f8b (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu)
Date: Thu, 24 Mar 2016 11:37:35 +0200
Message-ID: <87mvpodxnk.fsf@intel.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8686
Lines: 236

On Wed, 23 Mar 2016, Lyude <cpaul@redhat.com> wrote:
> Some sinks need some time during the process of resuming the system from
> sleep before they're ready to handle transactions. While it would be
> nice if they responded with NACKs in these scenarios, this isn't always
> the case as a few sinks will just timeout on all of the transactions
> they receive.
>
> This patch was originally intended to be a workaround for a very
> mysterious bug on the T560, where any monitors connected to the dock
> would fail to turn back on after resume. When resuming the laptop, it
> appears that there's a short period of time where we're unable to
> complete any aux transactions, as they all immediately timeout. The only
> machine I'm able to reproduce this on is the T560 as other production
> Skylake models seem to be fine. The period during which AUX transactions
> fail appears to be around 22ms long. AFAIK, the dock for the T560 never
> actually turns off, the only difference is that it's in SST mode at the
> start of the resume process, so it's unclear as to why it would need so
> much time to come back up.
>
> There's been a discussion on this issue going on for a while on the
> intel-gfx mailing list about this that has, in addition to including
> developers from Intel, also had the correspondence of one of the
> hardware engineers for Intel:
>
> http://www.spinics.net/lists/intel-gfx/msg88831.html
> http://www.spinics.net/lists/intel-gfx/msg88410.html
>
> We've already looked into a couple of possible explanations for the
> problem:
>
> - Calling intel_dp_mst_resume() before right fix.
>   intel_runtime_pm_enable_interrupts(). This was the first fix I tried,
>   and while it worked it definitely wasn't the right fix. This worked
>   because DP aux transactions don't actually require interrupts to work:
>
> 	static uint32_t
> 	intel_dp_aux_wait_done(struct intel_dp *intel_dp, bool has_aux_irq)
> 	{
> 		struct intel_digital_port *intel_dig_port = dp_to_dig_port(intel_dp);
> 		struct drm_device *dev = intel_dig_port->base.base.dev;
> 		struct drm_i915_private *dev_priv = dev->dev_private;
> 		i915_reg_t ch_ctl = intel_dp->aux_ch_ctl_reg;
> 		uint32_t status;
> 		bool done;
>
> 	#define C (((status = I915_READ_NOTRACE(ch_ctl)) & DP_AUX_CH_CTL_SEND_BUSY) == 0)
> 		if (has_aux_irq)
> 			done = wait_event_timeout(dev_priv->gmbus_wait_queue, C,
> 						  msecs_to_jiffies_timeout(10));
> 		else
> 			done = wait_for_atomic(C, 10) == 0;
> 		if (!done)
> 			DRM_ERROR("dp aux hw did not signal timeout (has irq: %i)!\n",
> 				  has_aux_irq);
> 	#undef C
>
> 		return status;
> 	}
>
>   When there's no interrupts enabled, we end up timing out on the
>   wait_event_timeout() call, which causes us to check the DP status
>   register once to see if the transaction was successful or not. Since
>   this adds a 10ms delay to each aux transaction, it ends up adding a
>   long enough delay to the resume process for aux transactions to become
>   functional again. This gave us the illusion that enabling interrupts
>   had something to do with making things work again, and put me on the
>   wrong track for a while.
>
> - Interrupts occurring when we try to perform the aux transactions
>   required to put the dock back into MST mode. This isn't the problem,
>   as the only interrupts I've observed that come during this timeout
>   period are from the snd_hda_intel driver, and disabling that driver
>   doesn't appear to change the behavior at all.
>
> - Skylake's PSR block causing issues by performing aux transactions
>   while we try to bring the dock out of MST mode. Disabling PSR through
>   i915's command line options doesn't seem to change the behavior
>   either, nor does preventing the DMC firmware from being loaded.
>
> Since this investigation went on for about 2 weeks, we decided it would
> be better for the time being to just workaround this issue until we find
> a better fix. This patch adds some behavior we want in
> drm_dp_dpcd_access() anyway, since DP aux transactions aren't exactly
> robust and this will probably fix quite a few other issues with DP MST
> hardware not responding in time. Plus, this is something we already do
> in the i915 driver with intel_dp_dpcd_read_wake(), except that that
> function is somewhat of a hack and DRM helpers can't make use of it.
>
> 			    Changes since v2
> - Reworked the patch again to incorporate all of the behavior of
>   intel_dp_dpcd_read_wake() into drm_dp_dpcd_read() and
>   drm_dp_dpcd_access()
>
> 			    Changes since v1
>
> - Patch has been reworked to take the retry logic out of
>   intel_dp_mst_resume() and into drm_dp_dpcd_access(), based off a
>   suggestion from Daniel Vetter
>
> - Commit message is much longer and gives a better description of the
>   issue this was originally intended to workaround.
>
> Signed-off-by: Lyude <cpaul@redhat.com>

We've had a lot of issues with dp aux in the past, and we've slowly and
iteratively improved over time. In that light, I am strongly opposed to
doing as many changes in one go as this patch does.

If we got a bisect result on this, it would be hard to decide what it
really was. We might have to revert it all even if some parts of it were
good.

I'll identify some of the changes below that I think must be separated
to distinct changes. (The numbers don't necessarily reflect the order in
which the changes should be done.)

> ---
> Left some rebase ditritus in the patch after sending it by accident, whoops
>
>  drivers/gpu/drm/drm_dp_helper.c | 47 +++++++++++++++++++++--------------------
>  1 file changed, 24 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_dp_helper.c b/drivers/gpu/drm/drm_dp_helper.c
> index 9535c5b..9b3426c 100644
> --- a/drivers/gpu/drm/drm_dp_helper.c
> +++ b/drivers/gpu/drm/drm_dp_helper.c
> @@ -159,7 +159,7 @@ int drm_dp_bw_code_to_link_rate(u8 link_bw)
>  }
>  EXPORT_SYMBOL(drm_dp_bw_code_to_link_rate);
>  
> -#define AUX_RETRY_INTERVAL 500 /* us */
> +#define AUX_RETRY_INTERVAL 1000 /* us */

Change #1. Increasing the retry interval.

>  
>  /**
>   * DOC: dp helpers
> @@ -177,8 +177,8 @@ static int drm_dp_dpcd_access(struct drm_dp_aux *aux, u8 request,
>  			      unsigned int offset, void *buffer, size_t size)
>  {
>  	struct drm_dp_aux_msg msg;
> -	unsigned int retry;
> -	int err;
> +	unsigned int retry, native_reply;
> +	int ret = 0;
>  
>  	memset(&msg, 0, sizeof(msg));
>  	msg.address = offset;
> @@ -193,35 +193,29 @@ static int drm_dp_dpcd_access(struct drm_dp_aux *aux, u8 request,
>  	 * sufficient, bump to 32 which makes Dell 4k monitors happier.
>  	 */
>  	for (retry = 0; retry < 32; retry++) {
> +		if (ret != 0 && ret != -ETIMEDOUT) {
> +			usleep_range(AUX_RETRY_INTERVAL,
> +				     AUX_RETRY_INTERVAL + 100);

Change #2. Changing the conditions on when to sleep between retries.

> +		}
>  
>  		mutex_lock(&aux->hw_mutex);
> -		err = aux->transfer(aux, &msg);
> +		ret = aux->transfer(aux, &msg);
>  		mutex_unlock(&aux->hw_mutex);
> -		if (err < 0) {
> -			if (err == -EBUSY)
> -				continue;

Change #3a. Changing which errors lead to retries.

> -
> -			return err;
> -		}
> -
> -
> -		switch (msg.reply & DP_AUX_NATIVE_REPLY_MASK) {
> -		case DP_AUX_NATIVE_REPLY_ACK:
> -			if (err < size)
> -				return -EPROTO;
> -			return err;
>  
> -		case DP_AUX_NATIVE_REPLY_NACK:
> -			return -EIO;

Change #3b. Changing which errors lead to retries.

> +		if (ret > 0) {
> +			native_reply = msg.reply & DP_AUX_NATIVE_REPLY_MASK;
> +			if (native_reply == DP_AUX_NATIVE_REPLY_ACK) {
> +				if (ret == size)
> +					return ret;
>  
> -		case DP_AUX_NATIVE_REPLY_DEFER:
> -			usleep_range(AUX_RETRY_INTERVAL, AUX_RETRY_INTERVAL + 100);
> -			break;
> +				ret = -EPROTO;
> +			} else
> +				ret = -EIO;
>  		}
>  	}
>  
>  	DRM_DEBUG_KMS("too many retries, giving up\n");
> -	return -EIO;
> +	return ret;

Change #4. Propagating errors from the last retry. (The error from the
last retry might not even be representative of what really went down.)

>  }
>  
>  /**
> @@ -241,6 +235,13 @@ static int drm_dp_dpcd_access(struct drm_dp_aux *aux, u8 request,
>  ssize_t drm_dp_dpcd_read(struct drm_dp_aux *aux, unsigned int offset,
>  			 void *buffer, size_t size)
>  {
> +	/*
> +	 * Sometimes we just get the same incorrect byte repeated over the
> +	 * entire buffer. Doing one throw away read initially seems to "solve"
> +	 * it.
> +	 */
> +	drm_dp_dpcd_access(aux, DP_AUX_NATIVE_READ, DP_DPCD_REV, buffer, 1);
> +

Change #5. Doing a dummy read to ensure the sink is awake.


BR,
Jani.

>  	return drm_dp_dpcd_access(aux, DP_AUX_NATIVE_READ, offset, buffer,
>  				  size);
>  }

-- 
Jani Nikula, Intel Open Source Technology Center