Skip to content

Ryzen 9 5950x error

事故描述

dmesg 报错:

<0>[790577.354149] mce: [Hardware Error]: CPU 6: Machine Check Exception: 4 Bank 1: bc800800060c0859
<0>[790577.354155] mce: [Hardware Error]: TSC 98830989a3dde ADDR 103ec1fa40 MISC d01a000000000000 IPID 100b000000000
<0>[790577.354160] mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1755581242 SOCKET 0 APIC c microcode a201210
<0>[790577.354163] mce: [Hardware Error]: Machine check: Uncorrected unrecoverable error in kernel context
<0>[790577.354164] Kernel panic - not syncing: Fatal local machine check

定位错误

寻找文档

使用 lscpu 查询 CPU 型号,其中有:

Model name:             AMD Ryzen 9 5950X 16-Core Processor
    CPU family:           25
    Model:                33

转换成 16进制, 25d = 19h, 33d = 21h,在 AMD 官网上查找 AMD Family 19h Model 21h 的文档:Processor Programming Reference (PPR) for AMD Family 19h Model 21h, Revision B0 Processors (PUB)

查阅文档

根据 3.2.3 [Mapping of Banks to Blocks] 可知,报错中出现的 bank 1 是 IF Block,即 Instruction Fetch Unit。因此可在 3.2.5.2 [IF] 中找到 MCA::IF::MCA_STATUS_IF 寄存器各位的含义,解释错误码:bc800800060c0859。

0x bc800800060c0859 = 
0b 1011 1100 1000 0000 0000 1000 0000 0000 0000 0110 0000 1100 0000 1000 0101 1001

因为默认是全零,所以重点关注非 0 字段:

bit 含义
63 1 Val A valid error has been detected.
61 1 UC The error was not corrected by hardware.
60 1 En CA error reporting is enabled for this error
59 1 MiscV Valid thresholding in MCA::IF::MCA_MISC0_IF.
58 1 AddrV MCA::IF::MCA_ADDR_IF contains address information associated with the error.
55 1 TCC Hardware context of the process thread to which the error was reported may have been corrupted.
43 1 Poison The error was the result of attempting to consume poisoned data.
29:24 000110 AddrLsb Specifies the least significant valid bit of the address contained in MCA::IF::MCA_ADDR_IF[ErrorAddr].
21:16 001100 ErrorCodeExt Extended Error Code. This field is used to identify the error type for root cause analysis. This field indicates which bit position in MCA::IF::MCA_CTL_IF enables error reporting for the logged
15:0 0x0859 ErrorCode Error code for this error. See 3.1.3.3 [Error Codes] for details on decoding this field.

根据上述信息,可知 CPU 发生了硬件无法纠正的错误,并且可能导致线程上下文损坏了,必须终止进程。

错误地址

通过 AddrLsb 可知,错误发生的位置可以定位到某一 64 字节的 cache line 上,地址需要通过 MCA::IF::MCA_ADDR_IF 这个寄存器查看,推断应该是 dmesg 报错第二行中的 ADDR 103ec1fa40 这个地址的 cache line 出错了。

ErrorCodeExt

ErrorCodeExt = 12, 查 MCA::IF::MCA_CTL_IF , 是:

L2RespPoison. Read-write. Reset: 0. L2 Cache Response Poison Error. Error is the result of consuming poison data.

但 3.1.3.3 [Error Codes] 说,MCA_STATUS[ErrorCodeExt] 并不直接表示错误类型本身,只是表示这个位使能了。就是说,错误发生实际位置可能不是 L2,但是 L2 确实获得了一个 poison 数据。

ErrorCode

根据 ErrorCode:0000 1000 0101 1001,查 3.1.3.3 [Error Codes],满足条件的只有 Table 26 中的第三条:

  • Error Code: 0000 1XXT RRRR XXLL
  • Error Code Type: Bus
  • Description:
    • XX = Reserved
    • T = Timeout
    • RRRR = Memory Transaction Type
    • LL = Cache Level

故推断:T = 0, RRRR = 0101, LL = 01。即报错的 cache level (LL) 是 L1,memory transaction type (RRRR) 是 Instruction Fetch。

结论

综上,报错原因应该是:在 0x103ec1fa40 这个 cache line 进行 Instruction Fetch 操作的时候,发生了 L1 Cache 相关的总线事务错误,并且这个错误还导致 L2 收到了 poison 数据。