Discussion:
[OpenAFS] disk cache read error in CacheItems
Martin Flemming
2018-10-23 05:35:55 UTC
Permalink
Hi !

In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted

In the /var/log/messages we see a lot of lines with e.g.

Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25251 off 2020100/13880020 code -5/80

till nothing happens anymore ...

The clients are Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS 1.6.23 built 2018-09-12 (***@fnal.gov)

Any hints for the possible reason ?

Thanks & Cheers,

Martin
Andreas Ladanyi
2018-10-23 10:16:28 UTC
Permalink
Hi Martin,
Post by Martin Flemming
Hi !
In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted
In the /var/log/messages we see a lot of lines with e.g.
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25251 off 2020100/13880020 code -5/80
till nothing happens anymore ...
The clients areĀ  Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS
Any hints for the possible reason ?
I have the same constellation with AFS 1.6.23 client from jsbilling repo.

I cant see this messages in /var/log/messages yet.


regards,

Andy
Stephan Wiesand
2018-10-23 12:14:38 UTC
Permalink
Post by Andreas Ladanyi
Post by Martin Flemming
In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted
In the /var/log/messages we see a lot of lines with e.g.
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25251 off 2020100/13880020 code -5/80
till nothing happens anymore ...
The clients are Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS
Any hints for the possible reason ?
I have the same constellation with AFS 1.6.23 client from jsbilling repo.
I cant see this messages in /var/log/messages yet.
We're running the same kernel version and the same client build (it's the SL one) on a fair number of SL 7.4 systems, and don't see these issues either.

-5 is EIO, meaning an actual I/O error is reported.

What's the size and type of the cache filesystems? What does "fs getcache report"? What are the afsd parameters? Could these nodes be out of space or inodes for the cache?
--
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany
Benjamin Kaduk
2018-10-24 02:27:42 UTC
Permalink
Post by Stephan Wiesand
Post by Andreas Ladanyi
Post by Martin Flemming
In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted
In the /var/log/messages we see a lot of lines with e.g.
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25251 off 2020100/13880020 code -5/80
till nothing happens anymore ...
The clients are Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS
Any hints for the possible reason ?
I have the same constellation with AFS 1.6.23 client from jsbilling repo.
I cant see this messages in /var/log/messages yet.
We're running the same kernel version and the same client build (it's the SL one) on a fair number of SL 7.4 systems, and don't see these issues either.
-5 is EIO, meaning an actual I/O error is reported.
What's the size and type of the cache filesystems? What does "fs getcache report"? What are the afsd parameters? Could these nodes be out of space or inodes for the cache?
It's also possible that the actual disk is having trouble, and/or got
remounted RO. dmesg and/or syslog might have some clues.

(Interestingly enough, we had some changes go by recently on master to make
the error handling for certain cases in this same class more graceful (i.e.,
fail requests but not panic), though those changes are not in 1.6.23.)

-Ben
Martin Flemming
2018-10-26 12:00:16 UTC
Permalink
Hi and thanks for response !

In the last days we've got the idential situtation with these error-messages ...
sometimes on all machines they started to log on the same time ...
network-traffic is not extremly high ...


filesystem of the afscache is ext4 and the size 8GB

Option are : /usr/vice/etc/afsd -afsdb -dynroot -fakestat

The cacheinfo-file : /usr/vice/etc/cacheinfo : /afs:/var/cache/afs:5552000

[***@bird070 ~]# fs getcacheparms -excessive
AFS using 88% of cache blocks (4908415 of 5552000 1k blocks)
29% of the cache files (49470 of 173500 files)
afs_cacheFiles: 173500
IFFree: 124030
IFEverUsed: 9551
IFDataMod: 3
IFDirtyPages: 0
IFAnyPages: 0
IFDiscarded: 0
DCentries: 9997
0k- 4K: 267
4k- 16k: 229
16k- 64k: 9061
64k- 256k: 212
256k- 1M: 10
=1M: 218
[***@bird070 ~]# df -i|grep cache |grep afs
/dev/sda3 512064 173599 338465 34% /var/cache/afs
[***@bird070 ~]# df -h|grep cache |grep afs
/dev/sda3 7.6G 4.7G 2.5G 66% /var/cache/afs

[***@bird058 ~]# fs getcacheparms -excessive
AFS using 86% of cache blocks (4768364 of 5552000 1k blocks)
25% of the cache files (43806 of 173500 files)
afs_cacheFiles: 173500
IFFree: 129694
IFEverUsed: 9929
IFDataMod: 2
IFDirtyPages: 0
IFAnyPages: 0
IFDiscarded: 0
DCentries: 9998
0k- 4K: 5074
4k- 16k: 1639
16k- 64k: 1728
64k- 256k: 440
256k- 1M: 115
=1M: 1002
[***@bird652 ~]# fs getcacheparms -excessive
AFS using 89% of cache blocks (4917473 of 5552000 1k blocks)
34% of the cache files (58678 of 173500 files)
afs_cacheFiles: 173500
IFFree: 114822
IFEverUsed: 9913
IFDataMod: 0
IFDirtyPages: 0
IFAnyPages: 0
IFDiscarded: 0
DCentries: 9999
0k- 4K: 2372
4k- 16k: 4863
16k- 64k: 2047
64k- 256k: 154
256k- 1M: 78
=1M: 485
thanks & cheers,

martin
Post by Stephan Wiesand
Post by Andreas Ladanyi
Post by Martin Flemming
In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted
In the /var/log/messages we see a lot of lines with e.g.
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25251 off 2020100/13880020 code -5/80
till nothing happens anymore ...
The clients are Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS
Any hints for the possible reason ?
I have the same constellation with AFS 1.6.23 client from jsbilling repo.
I cant see this messages in /var/log/messages yet.
We're running the same kernel version and the same client build (it's the SL one) on a fair number of SL 7.4 systems, and don't see these issues either.
-5 is EIO, meaning an actual I/O error is reported.
What's the size and type of the cache filesystems? What does "fs getcache report"? What are the afsd parameters? Could these nodes be out of space or inodes for the cache?
It's also possible that the actual disk is having trouble, and/or got
remounted RO. dmesg and/or syslog might have some clues.
(Interestingly enough, we had some changes go by recently on master to make
the error handling for certain cases in this same class more graceful (i.e.,
fail requests but not panic), though those changes are not in 1.6.23.)
-Ben
_______________________________________________
OpenAFS-info mailing list
https://lists.openafs.org/mailman/listinfo/openafs-info
Gruss

Martin Flemming


______________________________________________________
Martin Flemming
DESY / IT office : Building 2b / 008a
Notkestr. 85 phone : 040 - 8998 - 4667
22603 Hamburg mail : ***@desy.de
______________________________________________________

Loading...