ShardingSphere-XA事务无法恢复问题定位

问题背景

在测试 ShardingSphere-Proxy 使用达梦数据库开启 XA 事务进行压测期间,kill 掉达梦数据库,等一段时间再重启达梦数据库,最终未完成的 XA 事务无法恢复,数据产生不一致。
查看达梦上的 xid 一直有残留,没有被处理。
使用 ShardingSphere-Proxy 结合其他主流数据库进行测试时无该问题。

通过日志可以看到在压测期间,挂掉达梦数据库之后,有大量连接报错。节选错误日志如下:
比如下面这段日志,在执行 xa end 操作时,因为达梦数据库连接不上,达梦驱动报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[WARN ] 2024-01-29 11:11:19.867 [Connection-82-ThreadExecutor] com.arjuna.ats.jta - ARJUNA016041: prepare on < formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a4e, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a54, subordinatenodename=null, eis_name=0 > (org.apache.shardingsphere.transaction.xa.spi.SingleXAResource@169b2776) failed with exception ARJUNA016099: Unknown error code:6001
dm.jdbc.driver.DmdbXAException: Communication error
at dm.jdbc.driver.DBError.throwXAException(DBError.java:795)
at dm.jdbc.a.a.a(DBAccess.java:1315)
at dm.jdbc.driver.DmdbXAResource.end(DmdbXAResource.java:104)
at org.apache.shardingsphere.transaction.xa.spi.SingleXAResource.end(SingleXAResource.java:51)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.doEnd(TransactionImple.java:1087)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.endAssociation(TransactionImple.java:1058)
at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.endAssociation(XAResourceRecord.java:1295)
at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelPrepare(XAResourceRecord.java:210)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2678)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2628)
at com.arjuna.ats.arjuna.coordinator.BasicAction.prepare(BasicAction.java:2162)
at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1508)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:96)
at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
at org.apache.shardingsphere.transaction.xa.XAShardingSphereTransactionManager.commit(XAShardingSphereTransactionManager.java:129)
at org.apache.shardingsphere.proxy.backend.connector.jdbc.transaction.BackendTransactionManager.commit(BackendTransactionManager.java:94)
at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.doExecuteWithImplicitCommitTransaction(DatabaseConnector.java:259)
at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.execute(DatabaseConnector.java:228)
at org.apache.shardingsphere.proxy.frontend.mysql.command.query.text.query.MySQLComQueryPacketExecutor.execute(MySQLComQueryPacketExecutor.java:99)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.doExecuteCommand(CommandExecutorTask.java:129)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.executeCommand(CommandExecutorTask.java:124)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.run(CommandExecutorTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

下面这段日志,在执行 xa prepare 操作时,因为数据库连接不上,达梦驱动报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[WARN ] 2024-01-29 11:11:19.860 [Connection-85-ThreadExecutor] com.arjuna.ats.jta - ARJUNA016041: prepare on < formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5af8, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5b14, subordinatenodename=null, eis_name=0 > (org.apache.shardingsphere.transaction.xa.spi.SingleXAResource@4ce53a0a) failed with exception ARJUNA016099: Unknown error code:6001
dm.jdbc.driver.DmdbXAException: Communication error
at dm.jdbc.driver.DBError.throwXAException(DBError.java:795)
at dm.jdbc.a.a.a(DBAccess.java:1315)
at dm.jdbc.driver.DmdbXAResource.prepare(DmdbXAResource.java:174)
at org.apache.shardingsphere.transaction.xa.spi.SingleXAResource.prepare(SingleXAResource.java:73)
at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelPrepare(XAResourceRecord.java:214)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2678)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doPrepare(BasicAction.java:2628)
at com.arjuna.ats.arjuna.coordinator.BasicAction.prepare(BasicAction.java:2162)
at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1508)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:96)
at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
at org.apache.shardingsphere.transaction.xa.XAShardingSphereTransactionManager.commit(XAShardingSphereTransactionManager.java:129)
at org.apache.shardingsphere.proxy.backend.connector.jdbc.transaction.BackendTransactionManager.commit(BackendTransactionManager.java:94)
at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.doExecuteWithImplicitCommitTransaction(DatabaseConnector.java:259)
at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.execute(DatabaseConnector.java:228)
at org.apache.shardingsphere.proxy.frontend.mysql.command.query.text.query.MySQLComQueryPacketExecutor.execute(MySQLComQueryPacketExecutor.java:99)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.doExecuteCommand(CommandExecutorTask.java:129)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.executeCommand(CommandExecutorTask.java:124)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.run(CommandExecutorTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

xa commit 阶段,因为数据库连接不上,达梦驱动报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[WARN ] 2024-01-29 11:11:19.865 [Connection-56-ThreadExecutor] com.arjuna.ats.jta - ARJUNA016036: commit on < formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:58be, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:58c8, subordinatenodename=null, eis_name=0 > (org.apache.shardingsphere.transaction.xa.spi.SingleXAResource@576dcbef) failed with exception $ARJUNA016099: Unknown error code:6001
dm.jdbc.driver.DmdbXAException: Communication error
at dm.jdbc.driver.DBError.throwXAException(DBError.java:795)
at dm.jdbc.a.a.a(DBAccess.java:1315)
at dm.jdbc.driver.DmdbXAResource.commit(DmdbXAResource.java:88)
at org.apache.shardingsphere.transaction.xa.spi.SingleXAResource.commit(SingleXAResource.java:44)
at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelCommit(XAResourceRecord.java:473)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2906)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2822)
at com.arjuna.ats.arjuna.coordinator.BasicAction.phase2Commit(BasicAction.java:1878)
at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1534)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:96)
at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
at org.apache.shardingsphere.transaction.xa.XAShardingSphereTransactionManager.commit(XAShardingSphereTransactionManager.java:129)
at org.apache.shardingsphere.proxy.backend.connector.jdbc.transaction.BackendTransactionManager.commit(BackendTransactionManager.java:94)
at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.doExecuteWithImplicitCommitTransaction(DatabaseConnector.java:259)
at org.apache.shardingsphere.proxy.backend.connector.DatabaseConnector.execute(DatabaseConnector.java:228)
at org.apache.shardingsphere.proxy.frontend.mysql.command.query.text.query.MySQLComQueryPacketExecutor.execute(MySQLComQueryPacketExecutor.java:99)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.doExecuteCommand(CommandExecutorTask.java:129)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.executeCommand(CommandExecutorTask.java:124)
at org.apache.shardingsphere.proxy.frontend.command.CommandExecutorTask.run(CommandExecutorTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

持续观察日志,能够看到 xa recover 查询出来的 xid 在不断增多。

1
[INFO ] 2024-01-29 11:11:53.159 [Periodic Recovery] o.a.s.t.x.n.m.DataSourceXAResourceRecoveryHelper - 115174042, recover, result:[dm.jdbc.driver.DmdbXid@6466d28f, dm.jdbc.driver.DmdbXid@2179218b, dm.jdbc.driver.DmdbXid@7cd785b1, dm.jdbc.driver.DmdbXid@2f889463, dm.jdbc.driver.DmdbXid@7da8d9b7, dm.jdbc.driver.DmdbXid@1535daf0, dm.jdbc.driver.DmdbXid@550f3fea, dm.jdbc.driver.DmdbXid@3ecf92a4, dm.jdbc.driver.DmdbXid@59181ab, dm.jdbc.driver.DmdbXid@3e4829b3, dm.jdbc.driver.DmdbXid@1294a1ff, dm.jdbc.driver.DmdbXid@500c5bbb, dm.jdbc.driver.DmdbXid@599914c8, dm.jdbc.driver.DmdbXid@2663abf9, dm.jdbc.driver.DmdbXid@11cf7d93, dm.jdbc.driver.DmdbXid@232c3d92, dm.jdbc.driver.DmdbXid@3282426b, dm.jdbc.driver.DmdbXid@9051ba1, dm.jdbc.driver.DmdbXid@63ccccf7, dm.jdbc.driver.DmdbXid@21ceadb1, dm.jdbc.driver.DmdbXid@17f32f9, dm.jdbc.driver.DmdbXid@2006e964, dm.jdbc.driver.DmdbXid@25a832e7, dm.jdbc.driver.DmdbXid@71dfb0c6, dm.jdbc.driver.DmdbXid@94fbcfe, dm.jdbc.driver.DmdbXid@1e15297a, dm.jdbc.driver.DmdbXid@7a4e0f55, dm.jdbc.driver.DmdbXid@761bc95a, dm.jdbc.driver.DmdbXid@247332b, dm.jdbc.driver.DmdbXid@352e94e0, dm.jdbc.driver.DmdbXid@3f9001b7, dm.jdbc.driver.DmdbXid@17e95e06, dm.jdbc.driver.DmdbXid@2c0cf9bb, dm.jdbc.driver.DmdbXid@6dc4f14e, dm.jdbc.driver.DmdbXid@5266c8f8, dm.jdbc.driver.DmdbXid@fa2a5dc, dm.jdbc.driver.DmdbXid@44a844b2, dm.jdbc.driver.DmdbXid@e2c93e8, dm.jdbc.driver.DmdbXid@6ae9bbdd, dm.jdbc.driver.DmdbXid@707f6f19, dm.jdbc.driver.DmdbXid@76742870, dm.jdbc.driver.DmdbXid@1cad4685, dm.jdbc.driver.DmdbXid@3b3d219e, dm.jdbc.driver.DmdbXid@30319647, dm.jdbc.driver.DmdbXid@67d11f99, dm.jdbc.driver.DmdbXid@113ee488, dm.jdbc.driver.DmdbXid@435d06e0, dm.jdbc.driver.DmdbXid@7fe22362, dm.jdbc.driver.DmdbXid@1020d6d0, dm.jdbc.driver.DmdbXid@76dc7af8, dm.jdbc.driver.DmdbXid@5d1d18c5, dm.jdbc.driver.DmdbXid@338ac937, dm.jdbc.driver.DmdbXid@258a1fc0, dm.jdbc.driver.DmdbXid@3014aa33, dm.jdbc.driver.DmdbXid@5fa57c4, dm.jdbc.driver.DmdbXid@60dcf53d, dm.jdbc.driver.DmdbXid@15dd92f9, dm.jdbc.driver.DmdbXid@8de743, dm.jdbc.driver.DmdbXid@579ed4e1, dm.jdbc.driver.DmdbXid@506cb632, dm.jdbc.driver.DmdbXid@29dfbc9c, dm.jdbc.driver.DmdbXid@4607f11b, dm.jdbc.driver.DmdbXid@679f1ce6, dm.jdbc.driver.DmdbXid@52f7fb01], delegate:class dm.jdbc.driver.DmdbXAResource

日志中没有看到期望的二阶段恢复操作,可疑的异常日志如下:Transaction 0:ffff7f000001:926b:65b7167e:5a65 restored heuristic participant

1
2
3
4
5
2024-01-29 14:09:25.984 [Periodic Recovery] WARN  com.arjuna.ats.jta - ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a65, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a76, subordinatenodename=null, eis_name=0 >, heuristic: TwoPhaseOutcome.HEURISTIC_HAZARD com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6c769e03 >
2024-01-29 14:09:25.984 [Periodic Recovery] WARN com.arjuna.ats.jta - XAResourceRecord restored heuristic instance: XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a65, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a76, subordinatenodename=null, eis_name=0 >, heuristic: TwoPhaseOutcome.HEURISTIC_HAZARD com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6c769e03 >
2024-01-29 14:09:25.985 [Periodic Recovery] TRACE com.arjuna.ats.arjuna - RecordList::insert(RecordList: empty) : appending /StateManager/AbstractRecord/XAResourceRecord for 0:ffff7f000001:926b:65b7167e:5a77
2024-01-29 14:09:25.985 [Periodic Recovery] WARN com.arjuna.ats.arjuna - Transaction 0:ffff7f000001:926b:65b7167e:5a65 restored heuristic participant XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:926b:65b7167e:5a65, node_name=806e4bdb-b629-434b-bcee-0d86fa9cb8b0, branch_uid=0:ffff7f000001:926b:65b7167e:5a76, subordinatenodename=null, eis_name=0 >, heuristic: TwoPhaseOutcome.HEURISTIC_HAZARD com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6c769e03 >
2024-01-29 14:09:25.985 [Periodic Recovery] TRACE com.arjuna.ats.arjuna - HeuristicList - Unpacked a 463 record

所以现在的问题就是 ShardingSphere-Proxy 能够查询出来 dameng 上没完成的事务,但是没有执行二阶段 xa commit 或 xa rollback 操作,具体原因需要继续分析。

同样对比 Oracle 数据库,在 XA 事务压测期间 kill 掉 Oracle,后续再重启 Oracle 事务是可以正常恢复的,最终数据一致。日志如下:

1
2
3
[INFO ] 2024-01-29 11:04:29.279 [Periodic Recovery] o.a.s.t.x.n.m.DataSourceXAResourceRecoveryHelper - 497281350, recover, result:[oracle.jdbc.xa.OracleXid@20b27], delegate:class oracle.jdbc.driver.T4CXAResource

[INFO ] 2024-01-29 11:04:30.285 [Periodic Recovery] o.a.s.t.x.n.m.DataSourceXAResourceRecoveryHelper - 497281350, rollback, xid:< formatId=131077, gtrid_length=64, bqual_length=36, tx_uid=0:ffff7f000001:a7f4:65b7143a:351b, node_name=15774b67-7ca7-4b1b-90ee-e18fec83225b, branch_uid=0:ffff7f000001:a7f4:65b7143a:351c, subordinatenodename=null, eis_name=0 >

问题定位

本地搭建环境尝试复现问题

前置准备:
1.执行脚本清理 dameng 和 mysql 数据库上残留的 xid 和锁,清理数据。
2.proxy 配置。

本地在 ShardingSphere-Proxy 执行 xa commit 之前,调用 shell 自动 kill 掉数据库,模拟二阶段 commit 失败,后续查看事务恢复流程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public final class SingleXAResource implements XAResource {

private final String resourceName;

private final XAResource delegate;

private static AtomicBoolean killed = new AtomicBoolean();

@Override
public void commit(final Xid xid, final boolean onePhase) throws XAException {
count.getAndIncrement();
// 在执行 xa commit 之前自动 kill 掉数据库实例
if (killed.compareAndSet(false, true)) {
ShellUtil.callShellByExec("docker stop mysqlv8-14310");
}
delegate.commit(xid, onePhase);
}
// ...
}

后续本地 debug 一直发现达梦数据库挂掉和 mysql 数据库挂掉后,两个后续执行的逻辑不一样。

mysql 的话,会正常走恢复流程。而达梦残留的 xa 事务一直被忽略处理。

最后 debug mysql 挂掉 和 达梦挂掉之后的区别。

发现二者在调用 xa 接口时因为数据库挂掉返回的错误码不一致,达梦返回的错误码没有按照 xa 规范进行处理,所以事务管理器 TM 无法根据错误码执行后续恢复操作。

MySQL 挂掉后事务恢复流程

MySQL JDBC 驱动处理逻辑如下:

1
2
3
4
5
6
// com.mysql.cj.jdbc.MysqlXAConnection
// MySQL 驱动会对数据库异常处理成符合 XA 规范的错误码返回。
protected static XAException mapXAExceptionFromSQLException(SQLException sqlEx) {
Integer xaCode = (Integer)MYSQL_ERROR_CODES_TO_XA_ERROR_CODES.get(sqlEx.getErrorCode());
return xaCode != null ? (XAException)(new MysqlXAException(xaCode, sqlEx.getMessage(), (String)null)).initCause(sqlEx) : (XAException)(new MysqlXAException(-7, Messages.getString("MysqlXAConnection.003"), (String)null)).initCause(sqlEx);
}

MySQL 在执行 xa commit 等阶段因为数据库挂掉报错如下,返回错误码 -7

查看 JTA 规范里 XAException 类中定义的错误码,-7 表示 RM unavailable.

1
2
3
4
5
// javax.transaction.xa.XAException
/**
* Resource manager is unavailable.
*/
public final static int XAER_RMFAIL = -7;


Narayana 正确处理该错误码,后续进行事务恢复操作。

达梦挂掉后事务恢复流程

返回错误码 6001,看起来像是一个内部错误码,没有按照 XA 规范返回正确的错误码。会走最后一个分支,后续该事务无法恢复。

HEURISTIC_HAZARD 的含义如下:

1
2
// 在准备阶段之后,一些子参与者已经提交,一些已经回滚,还有一些我们不清楚。
public static final int HEURISTIC_HAZARD = 6; // after prepare some sub-participants committed, some rolled back and some we don't know

XA规范

XA 规范中明确定义了 xa_commit 等操作应该返回的错误码。

1
2
3
4
-- 第47页
XAException.XAER_RMFAIL is used when some connection error happens and it is expected that reconnection of RM could occur

当发生某些连接错误时使用 XAException.XAER_RMFAIL,预计 RM 可能会重新连接

其他数据库处理流程

postgresql 驱动处理 xa 异常逻辑

Pg 同样处理,当数据库连接异常时返回 XAER_RMFAIL 错误码

问题解决

在TM侧兼容达梦原生异常,或者RM驱动修复返回正确的错误码即可。

参考

narayana Heuristic异常

xa规范文档

pg驱动对xa-commit异常处理的代码