A133 冻屏问题

android 10 内核死锁

Posted by LXG on August 5, 2023

冻屏日志-1

已经关闭app硬件加速的情况

【FAQ1575】 RCU stall


[  476.654262] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  476.660702]  3-...: (1 GPs behind) idle=e89/140000000000000/0 softirq=20754/20755 fqs=1033
[  476.670119]  (detected by 2, t=7807 jiffies, g=8105, c=8104, q=1263)
[  476.677305] Task dump for CPU 3:
[  476.680949] kworker/u8:0    R  running task        0     5      2 0x00000012
[  476.690048] Workqueue: pvr_misr MISRWrapper [pvrsrvkm]
[  476.695865] Call trace:
[  476.698647] [<ffffff8008086dc4>] __switch_to+0x9c/0xf0
[  476.705505] [<ffffff8000d3ae40>] MISRWrapper+0x20/0x30 [pvrsrvkm]
[  476.712389] [<ffffff80080d50f0>] process_one_work+0x218/0x7c8
[  476.718874] [<ffffff80080d57d8>] worker_thread+0x138/0x4f8
[  476.725068] [<ffffff80080dd06c>] kthread+0xf4/0x108
[  476.730575] [<ffffff8008083d10>] ret_from_fork+0x10/0x40
[  476.736581] rcu_preempt kthread starved for 4489 jiffies! g8105 c8104 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1


[89052.228407] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:2:49]
[89052.237049] Modules linked in: gt9xxnew_ts aic8800_fdrv pvrsrvkm(O) aic8800_btlpm aic8800_bsp uvcvideo videobuf2_vmallvin_v4l2 gc2053_mipi vin_io videobuf2_v4l2 videobuf2_dma_contig videobuf2_memops videobuf2_core
[89052.258891] irq event stamp: 10
[89052.262453] hardirqs last  enabled at (9): [<ffffff8008c24530>] _raw_spin_unlock_irq+0x38/0x78
[89052.272160] hardirqs last disabled at (10): [<ffffff8008c1d7a4>] __schedule+0xbc/0xba8
[89052.281088] softirqs last  enabled at (0): [<ffffff80080a8c00>] copy_process.isra.9.part.10+0x2d0/0x16a0
[89052.291762] softirqs last disabled at (0): [<          (null)>]           (null)
[89052.300086]
[89052.301779] CPU: 0 PID: 49 Comm: kworker/u8:2 Tainted: G        W  O L  4.9.170 #1
[89052.310299] Hardware name: sun50iw10 (DT)
[89052.314828] Workqueue: events_unbound call_usermodehelper_exec_work
[89052.321898] task: ffffffc03d2f0000 task.stack: ffffffc03d2ec000
[89052.328569] PC is at smp_call_function_many+0x290/0x2f8
[89052.334458] LR is at smp_call_function_many+0x248/0x2f8

日志所在源码-1

kernel/rcu/tree.c


static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
{
        // ---------------------------------------------------------------------------

	/*
	 * OK, time to rat on our buddy...
	 * See Documentation/RCU/stallwarn.txt for info on how to debug
	 * RCU CPU stall warnings.
	 */
	pr_err("INFO: %s detected stalls on CPUs/tasks:",
	       rsp->name);
	print_cpu_stall_info_begin();

	rcu_for_each_leaf_node(rsp, rnp) {
		raw_spin_lock_irqsave_rcu_node(rnp, flags);
		ndetected += rcu_print_task_stall(rnp);
		if (rnp->qsmask != 0) {
			for_each_leaf_node_possible_cpu(rnp, cpu)
				if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
					print_cpu_stall_info(rsp, cpu);
					ndetected++;
				}
		}
		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
	}


	print_cpu_stall_info_end();

	// ----------------------------------------------------------------------------

	pr_cont("(detected by %d, t=%ld jiffies, g=%ld, c=%ld, q=%lu)\n",
	       smp_processor_id(), (long)(jiffies - rsp->gp_start),
	       (long)rsp->gpnum, (long)rsp->completed, totqlen);

        // ----------------------------------------------------------------------------
}

./kernel/rcu/tree_plugin.h


static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
{
	char fast_no_hz[72];
	struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
	struct rcu_dynticks *rdtp = rdp->dynticks;
	char *ticks_title;
	unsigned long ticks_value;

	/*
	 * We could be printing a lot while holding a spinlock.  Avoid
	 * triggering hard lockup.
	 */
	touch_nmi_watchdog();

	if (rsp->gpnum == rdp->gpnum) {
		ticks_title = "ticks this GP";
		ticks_value = rdp->ticks_this_gp;
	} else {
		ticks_title = "GPs behind";
		ticks_value = rsp->gpnum - rdp->gpnum;
	}
	print_cpu_stall_fast_no_hz(fast_no_hz, cpu);
	pr_err("\t%d-%c%c%c: (%lu %s) idle=%03x/%llx/%d softirq=%u/%u fqs=%ld %s\n",
	       cpu,
	       "O."[!!cpu_online(cpu)],
	       "o."[!!(rdp->grpmask & rdp->mynode->qsmaskinit)],
	       "N."[!!(rdp->grpmask & rdp->mynode->qsmaskinitnext)],
	       ticks_value, ticks_title,
	       atomic_read(&rdtp->dynticks) & 0xfff,
	       rdtp->dynticks_nesting, rdtp->dynticks_nmi_nesting,
	       rdp->softirq_snap, kstat_softirqs_cpu(RCU_SOFTIRQ, cpu),
	       READ_ONCE(rsp->n_force_qs) - rsp->n_force_qs_gpstart,
	       fast_no_hz);
}

冻屏日志-2

未关闭app硬件加速情况


08-05 13:32:35.559544[14306.902019] PVR_K:(Error):  1927: DevicesWatchdogThread: Device status not OK!!!
08-05 13:32:35.559914[14306.910454] PVR_K:  1927: ------------[ PVR DBG: START (High) ]------------
08-05 13:32:35.559914[14306.910454] PVR_K:  1927: ------------[ PVR DBG: START (High) ]------------
08-05 13:32:35.560079[14306.918380] PVR_K:  1927: OS kernel info: Linux 4.9.170 #1 SMP PREEMPT Sun Jun 25 18:01:19 CST 2023 aarch64
08-05 13:32:35.580870[14306.929453] PVR_K:  1927: DDK info: Rogue_DDK_Android rogueddk 1.11@5516664 (release) sunxi_android
08-05 13:32:35.586504[14306.939946] PVR_K:  1927: Time now: 14306939932us
08-05 13:32:35.586698[14306.945344] PVR_K:  1927: Services State: OK
08-05 13:32:35.607212[14306.950226] PVR_K:  1927: Active connects: 1774 (composer@2.2-se), 1773 (android.hardwar), 15334 (surfaceflinger), 15412 (android.ui), 15635 (RenderThread),
08-05 13:32:35.607390[14306.966108] PVR_K:  1927:                  16319 (locker_pleasure), 15771 (launcher-loader)
08-05 13:32:35.622563[14306.975642] PVR_K:  1927: ------[ Driver Info ]------
08-05 13:32:35.622752[14306.981410] PVR_K:  1927: UM info: 1.11 @  5516664 (release) build options: 0x80002010
08-05 13:32:35.640662[14306.990444] PVR_K:  1927: KM info: 1.11 @  5516664 (release) build options: 0x00002010
08-05 13:32:35.640852[14306.999454] PVR_K:  1927: FW info: 1.11 @  5516664 (release) build options: 0x80002010
08-05 13:32:35.656730[14307.008478] PVR_K:  1927: Comparison of UM/KM components: MATCHING
08-05 13:32:35.656915[14307.015520] PVR_K:  1927: KM Arch: 64 Bit
08-05 13:32:35.667111[14307.020179] PVR_K:  1927: UM Connected Clients: 32 Bit
08-05 13:32:35.667301[14307.025983] PVR_K:  1927: ------[ RGX Summary ]------
08-05 13:32:35.678057[14307.031664] PVR_K:  1927: RGX BVNC: 22.102.54.38
08-05 13:32:35.678178[14307.036868] PVR_K:  1927: RGX Device State: Active
08-05 13:32:35.688539[14307.042427] PVR_K:  1927: RGX Power State: ON
08-05 13:32:35.688667[14307.047350] PVR_K:  1927: BIF0 - OK
08-05 13:32:35.702670[14307.051291] PVR_K:  1927: RGX FW State: NOT RESPONDING - KCCB stalled (HWRState 0x00000001: HWR OK;)
08-05 13:32:35.702761[14307.061542] PVR_K:  1927: RGX FW Power State: RGXFWIF_POW_ON (APM disabled: 0 ok, 0 denied, 0 non-idle, 0 retry, 0 other, 0 total. Latency: 100 ms)
08-05 13:32:35.729212[14307.076345] PVR_K:  1927: RGX DVFS: 0 frequency changes. Current frequency: 504.000 MHz (sampled at 14305382022176)
08-05 13:32:35.729327[14307.088021] PVR_K:  1927: RGX FW OS 0 - State: active; Freelists: Ok
08-05 13:32:35.743957[14307.095164] PVR_K:  1927: T0 polling set (reg:0x00000130 mask:0x00000C00)
08-05 13:32:35.744056[14307.102780] PVR_K:  1927: ------[ RGX registers ]------

// 下边一行是Android log报错
08-05 13:32:35.788503 15334 15334 E Fence   : Throttling EGL Production: fence 35 didn't signal in 3000 ms
-------------------------------------------------------------------------------------------------------------------------------------

08-05 13:32:37.773333[14309.132108] PVR_K:  1927: ------------[ PVR DBG: END ]------------
08-05 13:32:37.786987[14309.139167] ------------[ cut here ]------------
08-05 13:32:37.787134[14309.139674] WARNING: CPU: 1 PID: 1927 at /home/sy/code/sdc/a133_system_block/A133_android_10/longan/kernel/linux-4.9/modules/gpu/img-rgx/android/rogue_km/binary_sunxi_android_release/target_aarch64/kbuild/services/server/common/pvr_notifier.c:506 PVRSRVDebugRequest+0x1f4/0x2d0 [pvrsrvkm]
08-05 13:32:37.787197[14309.139681] Modules linked in: aic8800_fdrv gt9xxnew_ts pvrsrvkm(O) aic8800_btlpm aic8800_bsp uvcvideo videobuf2_vmalloc vin_v4l2 gc2053_mipi vin_io videobuf2_v4l2 videobuf2_dma_contig videobuf2_memops videobuf2_core [last unloaded: aic8800_fdrv]
08-05 13:32:37.787252[14309.139729] 
08-05 13:32:37.787311[14309.139740] CPU: 1 PID: 1927 Comm: pvr_device_wdg Tainted: G           O    4.9.170 #1
08-05 13:32:37.787363[14309.139745] Hardware name: sun50iw10 (DT)
08-05 13:32:37.787415[14309.139753] task: ffffffc027db2b80 task.stack: ffffffc03927c000
08-05 13:32:37.787466[14309.140205] PC is at PVRSRVDebugRequest+0x1f4/0x2d0 [pvrsrvkm]
08-05 13:32:37.787511[14309.140627] LR is at PVRSRVDebugRequest+0x1f4/0x2d0 [pvrsrvkm]
08-05 13:32:37.787556[14309.140635] pc : [<ffffff8000c583dc>] lr : [<ffffff8000c583dc>] pstate: 00400145
08-05 13:32:37.787607[14309.140640] sp : ffffffc03927fc40

-----------------------------------------------------------------------------------------------------------------------------------------

08-05 13:32:37.791182[14309.143512] ---[ end trace 3f76441f672a30b9 ]---
08-05 13:32:37.791219[14309.143519] Call trace:
08-05 13:32:37.791256[14309.143528] Exception stack(0xffffffc03927fa30 to 0xffffffc03927fb60)
08-05 13:32:37.791294[14309.143537] fa20:                                   ffffffc0257da6a0 0000007fffffffff
08-05 13:32:37.791338[14309.143546] fa40: 00000000415f8000 ffffff8000c583dc 0000000000400145 000000000000003d
08-05 13:32:37.791396[14309.143555] fa60: ffffffc02f03d000 ffffff800934df10 000000000036ee80 ffffff8000d22458
08-05 13:32:37.791441[14309.143564] fa80: ffffffc03927fb70 ffffffc03927fb70 ffffffc03927fb30 00000000ffffffc8
08-05 13:32:37.791485[14309.143572] faa0: ffffffc03927fb70 ffffff8000c3d76c ffffff8000d22460 0000000000000009
08-05 13:32:37.791529[14309.143581] fac0: ffffffc03927faf0 ffffff8008b1c6e0 ffffff8000d1d0e8 ffffffc03927fb38
08-05 13:32:37.791573[14309.143591] fae0: ffffff800934df10 ffffffc03927fb38 ffffffc03927fb50 ffffff8008b1c7a0
08-05 13:32:37.791617[14309.143599] fb00: ffffff8000d1d0e8 00000000000409aa 0000000000000000 0000000000000000
08-05 13:32:37.791661[14309.143608] fb20: 0000000000000000 ffffff8000d1d0ec 000000403673e000 0000000000000000
08-05 13:32:37.791705[14309.143617] fb40: ffffffc03f703090 ffffff800934df10 0000000000000000 ffffffc03927f8b0
08-05 13:32:37.791749[14309.144040] [<ffffff8000c583dc>] PVRSRVDebugRequest+0x1f4/0x2d0 [pvrsrvkm]
08-05 13:32:37.791787[14309.144463] [<ffffff8000c58860>] DevicesWatchdogThread_ForEachVaCb+0xa8/0xe8 [pvrsrvkm]
08-05 13:32:37.791826[14309.144882] [<ffffff8000c4a3dc>] List_PVRSRV_DEVICE_NODE_ForEach_va+0x7c/0xb8 [pvrsrvkm]
08-05 13:32:37.791864[14309.145304] [<ffffff8000c586e4>] DevicesWatchdogThread+0x134/0x208 [pvrsrvkm]
08-05 13:32:37.791902[14309.145719] [<ffffff8000c33e20>] OSThreadRun+0x28/0x60 [pvrsrvkm]
08-05 13:32:37.791939[14309.145734] [<ffffff80080d06cc>] kthread+0xec/0x100
08-05 13:32:37.791977[14309.145745] [<ffffff80080834f0>] ret_from_fork+0x10/0x20

日志所在源码-2

longan/kernel/linux-4.9/modules/gpu/img-rgx/linux/rogue_km/services/server/common/pvrsrv.c


static void DevicesWatchdogThread_ForEachVaCb(PVRSRV_DEVICE_NODE *psDeviceNode,
											  va_list va)
{
	PVRSRV_RGXDEV_INFO *psDevInfo = (PVRSRV_RGXDEV_INFO *) psDeviceNode->pvDevice;
	PVRSRV_DEVICE_HEALTH_STATUS *pePreviousHealthStatus, eHealthStatus;
	PVRSRV_ERROR eError;

	pePreviousHealthStatus = va_arg(va, PVRSRV_DEVICE_HEALTH_STATUS *);

	if (psDeviceNode->pfnUpdateHealthStatus != NULL)
	{
		eError = psDeviceNode->pfnUpdateHealthStatus(psDeviceNode, IMG_TRUE);
		if (eError != PVRSRV_OK)
		{
			PVR_DPF((PVR_DBG_WARNING, "DevicesWatchdogThread: "
					 "Could not check for fatal error (%d)!",
					 eError));
		}
	}
	eHealthStatus = OSAtomicRead(&psDeviceNode->eHealthStatus);

	if (eHealthStatus != PVRSRV_DEVICE_HEALTH_STATUS_OK)
	{
		if (eHealthStatus != *pePreviousHealthStatus)
		{
			if (!(psDevInfo->ui32DeviceFlags &
				  RGXKM_DEVICE_STATE_DISABLE_DW_LOGGING_EN))
			{
				PVR_DPF((PVR_DBG_ERROR, "DevicesWatchdogThread: "
						 "Device status not OK!!!"));
				PVRSRVDebugRequest(psDeviceNode, DEBUG_REQUEST_VERBOSITY_MAX,
								   NULL, NULL);
			}
		}
	}

	*pePreviousHealthStatus = eHealthStatus;
}

android log

08-05 13:32:35.788503 15334 15334 E Fence : Throttling EGL Production: fence 35 didn’t signal in 3000 ms

frameworks/native/libs/gui/BufferQueueProducer.cpp


status_t BufferQueueProducer::queueBuffer(int slot,
        const QueueBufferInput &input, QueueBufferOutput *output) {


    // Wait without lock held
    if (connectedApi == NATIVE_WINDOW_API_EGL) {
        // Waiting here allows for two full buffers to be queued but not a
        // third. In the event that frames take varying time, this makes a
        // small trade-off in favor of latency rather than throughput.
        lastQueuedFence->waitForever("Throttling EGL Production");
    }

    //-----------------------------------------------------------------------

}

frameworks/native/libs/ui/Fence.cpp


status_t Fence::waitForever(const char* logname) {
    ATRACE_CALL();
    if (mFenceFd == -1) {
        return NO_ERROR;
    }
    int warningTimeout = 3000;
    int err = sync_wait(mFenceFd, warningTimeout);
    if (err < 0 && errno == ETIME) {
        ALOGE("%s: fence %d didn't signal in %u ms", logname, mFenceFd.get(),
                warningTimeout);
        err = sync_wait(mFenceFd, TIMEOUT_NEVER);
    }
    return err < 0 ? -errno : status_t(NO_ERROR);
}

Linux 死锁

关键词:softlockup、hardlockup、hang、blocked

有如下现象之一就可以怀疑是系统死锁问题:

  • 使用串口,按键,屏幕等外设都没有反应
  • 某应用进程阻塞
  • watchdog 未喂狗

死锁问题分析

【FAQ25】 如何打开死锁调试选项

什么是死锁

死锁是指多个进程(线程)因为长久等待已被其他进程占有的的资源而陷入阻塞的一种状态。当等待的资源一直得不到释放,死锁会一直持续下去。死锁一旦发生,程序本身是解决不了的,只能依靠外部力量使得程序恢复运行,例如重启,开门狗复位等。

死锁类型

D 状态死锁 : 进程等待 I/O 资源无法得到满足,长时间(系统默认配置 120 秒)处于 TASK_UNINTERRUPTIBLE 睡眠状态,这种状态下进程不响应异步信号(包括 kill -9)。如:进程与外设硬件的交互(如 read),通常使用这种状态来保证进程与设备的交互过程不被打断,否则设备可能处于不可控的状态。对于这种死锁的检测 Linux 提供的是 hung task 机制。触发该问题成因比较复杂多样,可能因为 synchronized_irq、mutex lock、内存不足等。D 状态死锁只是局部多进程间互锁,一般来说只是 hang 机、冻屏,机器某些功能没法使用,但不会导致没喂狗,而被狗咬死。

R 状态死锁: 进程长时间(系统默认配置 60 秒)处于 TASK_RUNNING 状态垄断 CPU 而不发生切换,一般情况下是进程关抢占或关中断后长时候执行任务、死循环,此时往往会导致多 CPU 间互锁,整个系统无法正常调度,导致喂狗线程无法执行,无法喂狗而最终看门狗复位的重启。该问题多为原子操作,spinlock 等 CPU 间并发操作处理不当造成。

死锁调试

内核专门设计了hung task机制用于检查长期处于D状态的进程。hung task机制的内核配置选项如下:

CONFIG_DETECTOR_HUNG_TASK=y

CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120

需要注意的是:HUNG TASK只能检测D状态的死锁,其核心思想为创建一个内核监测进程循环监测处于D状态的每一个进程(任务),统计它们在两次检测之间的调度次数,如果发现有任务在两次监测之间没有发生任何的调度则可判断该进程一直处于D状态,很有可能已经死锁,因此触发报警日志打印,输出进程的基本信息,栈回溯以及寄存器保存信息以供内核开发人员定位。D 状态死锁只是局部多进程间互锁,一般来说只是 hang 机、冻屏,机器某些功能没法使用,但不会导致没喂狗而系统死机。

如果要检测R状态的死锁,需要配置softlockup和hardlockup,使用lockup detector机制。

CONFIG_LOCKUP_DETECTOR=y

内核的lockdep死锁检查模块可以帮助我们定位R状态具体的死锁问题。lockdep的内核配置选项如下:

a、lockdep模块使能,统计信息,以及调试

CONFIG_DEBUG_LOCK_ALLOC=y

CONFIG_PROVE_LOCKING=y

CONFIG_LOCKDEP=y

CONFIG_LOCK_STAT=y

CONFIG_DEBUG_LOCKDEP=y

b、调试mutex锁

CONFIG_DEBUG_RT_MUTEXES=y

CONFIG_DEBUG_MUTEXES=y

CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y

c、调试spin lock

CONFIG_DEBUG_SPINLOCK=y

d、检查原子环境:

CONFIG_DEBUG_ATOMIC_SLEEP=y