Ximeng Guan
2018-09-05 21:50:03 UTC
Hello,
We have a heavily-used OpenAFS client (a design server) that has repeatedly shown a message of "Connection timed out" during peak hours in the past week.
When probed with rxdebug at port 7001, we find a high number of calls that had waited for a thread. (90 calls have waited for a thread, noBuffers 11, 2892 client connections)
According to the users, the loss of connection happens randomly and can last up to 10min. When the loss of connection happens, the server seems to be unreachable from outside (ping or ssh). But existing sessions still seem to respond to keyboard strokes from remote users.
The client is running CentOS 6.6, kernel 2.6.32-696.3.2.el6.x86_64 and OpenAFS 1.6.20.
I am suspecting that the problem may not be related to the client, but to the network service in general. But the returned message by rxdebug seems odd too.
Can someone help interpret this message for us?
Trying 172.16.101.82 (port 7001):
Free packets: 424/531, packet reclaims: 0, calls: 184990, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
90 calls have waited for a thread
rx stats: free packets 424, allocs 2067108999, alloc-failures(rcv 0/0,send 0/0,ack 0)
greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 11, selects 0, sendSelects 0
packets read: data 341104352 ack 2316391759 busy 216 abort 736892 ackall 0 challenge 49571 response 0 debug 3432 params 0 unused 0 unused 0 unused 0 version 0
other read counters: data 313908875, ack 2281732446, dup 7217124 spurious 61698033 dally 161427
packets sent: data 1985111690 ack 187255106 busy 0 abort 173 ackall 0 challenge 0 response 49567 debug 0 params 0 unused 0 unused 0 unused 0 version 0
other send counters: ack 187255106, data 1580369647 (not resends), resends 404742043, pushed 0, acked&ignored 2852065309
(these should be small) sendFailed 1470595, fatalErrors 153
Average rtt is 0.000, with 1163176554 samples
Minimum rtt is 0.000, maximum is 2.396
3 server connections, 2892 client connections, 77 peer structs, 1127 call structs, 39 free call structs
We looked at the file servers at port 7000 and didn't see the same congestion. However, we do see a log pattern that seems to be related to the high number of connections from that same client:
Wed Sep 5 13:02:45 2018 FindClient: stillborn client 00007FE9680116E0(1cedd31c); conn 00007FE77C011060 (host 172.16.101.82:7001) had client 00007FE968013AA0(1cedd31c)
Wed Sep 5 13:12:53 2018 FindClient: stillborn client 00007FE97001E400(1cedd370); conn 00007FE97401BB70 (host 172.16.101.82:7001) had client 00007FE97001E0C0(1cedd370)
Wed Sep 5 13:18:56 2018 FindClient: stillborn client 00007FE98C02A7C0(1cedd394); conn 00007FE974032180 (host 172.16.101.82:7001) had client 00007FE968013010(1cedd394)
Wed Sep 5 13:18:57 2018 FindClient: stillborn client 00007FE76401D840(1cedd39c); conn 0000000000C0A590 (host 172.16.101.82:7001) had client 00007FE98C02A7C0(1cedd39c)
Wed Sep 5 13:26:36 2018 FindClient: stillborn client 00007FE970020F10(1cedd3e4); conn 0000000000C07800 (host 172.16.101.82:7001) had client 00007FE968012240(1cedd3e4)
Wed Sep 5 13:37:09 2018 FindClient: stillborn client 00007FE98C029FA0(1cedd4c0); conn 00007FE77C0121B0 (host 172.16.101.82:7001) had client 00007FE968013690(1cedd4c0)
Thank you!
Best regards,
========================================
Ximeng (Simon) Guan, Ph.D.
Associate Principal Engineer
Royole Corporation
========================================
We have a heavily-used OpenAFS client (a design server) that has repeatedly shown a message of "Connection timed out" during peak hours in the past week.
When probed with rxdebug at port 7001, we find a high number of calls that had waited for a thread. (90 calls have waited for a thread, noBuffers 11, 2892 client connections)
According to the users, the loss of connection happens randomly and can last up to 10min. When the loss of connection happens, the server seems to be unreachable from outside (ping or ssh). But existing sessions still seem to respond to keyboard strokes from remote users.
The client is running CentOS 6.6, kernel 2.6.32-696.3.2.el6.x86_64 and OpenAFS 1.6.20.
I am suspecting that the problem may not be related to the client, but to the network service in general. But the returned message by rxdebug seems odd too.
Can someone help interpret this message for us?
Trying 172.16.101.82 (port 7001):
Free packets: 424/531, packet reclaims: 0, calls: 184990, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
90 calls have waited for a thread
rx stats: free packets 424, allocs 2067108999, alloc-failures(rcv 0/0,send 0/0,ack 0)
greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 11, selects 0, sendSelects 0
packets read: data 341104352 ack 2316391759 busy 216 abort 736892 ackall 0 challenge 49571 response 0 debug 3432 params 0 unused 0 unused 0 unused 0 version 0
other read counters: data 313908875, ack 2281732446, dup 7217124 spurious 61698033 dally 161427
packets sent: data 1985111690 ack 187255106 busy 0 abort 173 ackall 0 challenge 0 response 49567 debug 0 params 0 unused 0 unused 0 unused 0 version 0
other send counters: ack 187255106, data 1580369647 (not resends), resends 404742043, pushed 0, acked&ignored 2852065309
(these should be small) sendFailed 1470595, fatalErrors 153
Average rtt is 0.000, with 1163176554 samples
Minimum rtt is 0.000, maximum is 2.396
3 server connections, 2892 client connections, 77 peer structs, 1127 call structs, 39 free call structs
We looked at the file servers at port 7000 and didn't see the same congestion. However, we do see a log pattern that seems to be related to the high number of connections from that same client:
Wed Sep 5 13:02:45 2018 FindClient: stillborn client 00007FE9680116E0(1cedd31c); conn 00007FE77C011060 (host 172.16.101.82:7001) had client 00007FE968013AA0(1cedd31c)
Wed Sep 5 13:12:53 2018 FindClient: stillborn client 00007FE97001E400(1cedd370); conn 00007FE97401BB70 (host 172.16.101.82:7001) had client 00007FE97001E0C0(1cedd370)
Wed Sep 5 13:18:56 2018 FindClient: stillborn client 00007FE98C02A7C0(1cedd394); conn 00007FE974032180 (host 172.16.101.82:7001) had client 00007FE968013010(1cedd394)
Wed Sep 5 13:18:57 2018 FindClient: stillborn client 00007FE76401D840(1cedd39c); conn 0000000000C0A590 (host 172.16.101.82:7001) had client 00007FE98C02A7C0(1cedd39c)
Wed Sep 5 13:26:36 2018 FindClient: stillborn client 00007FE970020F10(1cedd3e4); conn 0000000000C07800 (host 172.16.101.82:7001) had client 00007FE968012240(1cedd3e4)
Wed Sep 5 13:37:09 2018 FindClient: stillborn client 00007FE98C029FA0(1cedd4c0); conn 00007FE77C0121B0 (host 172.16.101.82:7001) had client 00007FE968013690(1cedd4c0)
Thank you!
Best regards,
========================================
Ximeng (Simon) Guan, Ph.D.
Associate Principal Engineer
Royole Corporation
========================================