[OpenAFS] disk cache read error in CacheItems

Discussion:

Martin Flemming

2018-10-23 05:35:55 UTC

Hi !

In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted

In the /var/log/messages we see a lot of lines with e.g.

Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in CacheItems slot 25251 off 2020100/13880020 code -5/80

till nothing happens anymore ...

The clients are Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS 1.6.23 built 2018-09-12 (***@fnal.gov)

Any hints for the possible reason ?

Thanks & Cheers,

Martin

Andreas Ladanyi

2018-10-23 10:16:28 UTC

Permalink

Hi Martin,

Post by Martin Flemming
Hi !
In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted
In the /var/log/messages we see a lot of lines with e.g.
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25251 off 2020100/13880020 code -5/80
till nothing happens anymore ...
The clients are Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS
Any hints for the possible reason ?

I have the same constellation with AFS 1.6.23 client from jsbilling repo.

I cant see this messages in /var/log/messages yet.

regards,

Andy

Stephan Wiesand

2018-10-23 12:14:38 UTC

Permalink

Post by Andreas Ladanyi

Post by Martin Flemming
In the last few days we've observed an increasing number of Nodes,
which are no longer be reached and have to be rebooted
In the /var/log/messages we see a lot of lines with e.g.
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25254 off 2020340/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25253 off 2020260/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25252 off 2020180/13880020 code -5/80
Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
CacheItems slot 25251 off 2020100/13880020 code -5/80
till nothing happens anymore ...
The clients are Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS
Any hints for the possible reason ?

I have the same constellation with AFS 1.6.23 client from jsbilling repo.
I cant see this messages in /var/log/messages yet.

We're running the same kernel version and the same client build (it's the SL one) on a fair number of SL 7.4 systems, and don't see these issues either.

-5 is EIO, meaning an actual I/O error is reported.

What's the size and type of the cache filesystems? What does "fs getcache report"? What are the afsd parameters? Could these nodes be out of space or inodes for the cache?

--
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany

Benjamin Kaduk

2018-10-24 02:27:42 UTC

Permalink

Post by Stephan Wiesand

Post by Andreas Ladanyi

I have the same constellation with AFS 1.6.23 client from jsbilling repo.
I cant see this messages in /var/log/messages yet.

We're running the same kernel version and the same client build (it's the SL one) on a fair number of SL 7.4 systems, and don't see these issues either.
-5 is EIO, meaning an actual I/O error is reported.
What's the size and type of the cache filesystems? What does "fs getcache report"? What are the afsd parameters? Could these nodes be out of space or inodes for the cache?

It's also possible that the actual disk is having trouble, and/or got
remounted RO. dmesg and/or syslog might have some clues.

(Interestingly enough, we had some changes go by recently on master to make
the error handling for certain cases in this same class more graceful (i.e.,
fail requests but not panic), though those changes are not in 1.6.23.)

-Ben

Martin Flemming

2018-10-26 12:00:16 UTC

Permalink

Hi and thanks for response !

In the last days we've got the idential situtation with these error-messages ...
sometimes on all machines they started to log on the same time ...
network-traffic is not extremly high ...

filesystem of the afscache is ext4 and the size 8GB

Option are : /usr/vice/etc/afsd -afsdb -dynroot -fakestat

The cacheinfo-file : /usr/vice/etc/cacheinfo : /afs:/var/cache/afs:5552000

[***@bird070 ~]# fs getcacheparms -excessive
AFS using 88% of cache blocks (4908415 of 5552000 1k blocks)
29% of the cache files (49470 of 173500 files)
afs_cacheFiles: 173500
IFFree: 124030
IFEverUsed: 9551
IFDataMod: 3
IFDirtyPages: 0
IFAnyPages: 0
IFDiscarded: 0
DCentries: 9997
0k- 4K: 267
4k- 16k: 229
16k- 64k: 9061
64k- 256k: 212
256k- 1M: 10

=1M: 218

[***@bird070 ~]# df -i|grep cache |grep afs
/dev/sda3 512064 173599 338465 34% /var/cache/afs
[***@bird070 ~]# df -h|grep cache |grep afs
/dev/sda3 7.6G 4.7G 2.5G 66% /var/cache/afs

[***@bird058 ~]# fs getcacheparms -excessive
AFS using 86% of cache blocks (4768364 of 5552000 1k blocks)
25% of the cache files (43806 of 173500 files)
afs_cacheFiles: 173500
IFFree: 129694
IFEverUsed: 9929
IFDataMod: 2
IFDirtyPages: 0
IFAnyPages: 0
IFDiscarded: 0
DCentries: 9998
0k- 4K: 5074
4k- 16k: 1639
16k- 64k: 1728
64k- 256k: 440
256k- 1M: 115

=1M: 1002

[***@bird652 ~]# fs getcacheparms -excessive
AFS using 89% of cache blocks (4917473 of 5552000 1k blocks)
34% of the cache files (58678 of 173500 files)
afs_cacheFiles: 173500
IFFree: 114822
IFEverUsed: 9913
IFDataMod: 0
IFDirtyPages: 0
IFAnyPages: 0
IFDiscarded: 0
DCentries: 9999
0k- 4K: 2372
4k- 16k: 4863
16k- 64k: 2047
64k- 256k: 154
256k- 1M: 78

=1M: 485

thanks & cheers,

martin

Post by Stephan Wiesand

Post by Andreas Ladanyi

I have the same constellation with AFS 1.6.23 client from jsbilling repo.
I cant see this messages in /var/log/messages yet.

We're running the same kernel version and the same client build (it's the SL one) on a fair number of SL 7.4 systems, and don't see these issues either.
-5 is EIO, meaning an actual I/O error is reported.
What's the size and type of the cache filesystems? What does "fs getcache report"? What are the afsd parameters? Could these nodes be out of space or inodes for the cache?

It's also possible that the actual disk is having trouble, and/or got
remounted RO. dmesg and/or syslog might have some clues.
(Interestingly enough, we had some changes go by recently on master to make
the error handling for certain cases in this same class more graceful (i.e.,
fail requests but not panic), though those changes are not in 1.6.23.)
-Ben
_______________________________________________
OpenAFS-info mailing list
https://lists.openafs.org/mailman/listinfo/openafs-info

Gruss

Martin Flemming

______________________________________________________
Martin Flemming
DESY / IT office : Building 2b / 008a
Notkestr. 85 phone : 040 - 8998 - 4667
22603 Hamburg mail : ***@desy.de
______________________________________________________