Discussion:
[OpenAFS] accessing /afs processes go into device wait
John Sopko
2018-11-08 17:22:49 UTC
Permalink
I have been running two legacy Redhat 6.x web servers for several
years. The apache httpd processes started to go into device wait state
the last few days on one of the servers, the other server is fine,
both are configured pretty much the same. I tracked this down to the
web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
or cat /afs/.htaccess which does not exist, the commands take a long
time to complete and first go into device wait state, it can take
several minutes or they may hang indefinitely. The afs file system
seems to be working fine, just accessing under /afs is the problem. On
other Redhat 6.x systems accessing /afs is fast and have no problems.

I am running afsd with:

/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb

Note I tried fakestat-all to see if that would help, I have been
running just -fakesat, our db servers have afsdb records.

I removed all cells accept for our cell in CellServDB so only have this:

% pwd
/afs

% ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu/
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu/
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu/

I re-formatted the /usr/vice/cache partition and that did not help.

I cannot find any hardware problems, no clues in the syslog or on the
console, the system disk including the cache is on a raid1/mirror
disk. This is a Dell server and I run Dell OpenMange which is really
good at reporting system and especially disk errors.

I am running the same afsd verison on our remaining rhel 6.x servers:

% fs version
openafs 1.6.22.2

Distributor ID: RedHatEnterpriseWorkstation
Release: 6.10

The problem is intermittent but goes into device wait most of the
time, for example the first time ran fine, the second time it took
14.96 seconds.

% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w

% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:14.96 0.0% 0+0k 0+0io 0pf+0w

Thanks for any help or ideas to try.
--
John W. Sopko Jr.
University of North Carolina
Computer Science Dept CB 3175
Chapel Hill, NC 27599-3175

Fred Brooks Building; Room 140
Computer Services Systems Specialist
email: sopko AT cs.unc.edu
phone: 919-590-6144
Stephan Wiesand
2018-11-08 17:52:54 UTC
Permalink
Post by John Sopko
I have been running two legacy Redhat 6.x web servers for several
years. The apache httpd processes started to go into device wait state
the last few days on one of the servers, the other server is fine,
both are configured pretty much the same. I tracked this down to the
web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
or cat /afs/.htaccess which does not exist, the commands take a long
time to complete and first go into device wait state, it can take
several minutes or they may hang indefinitely. The afs file system
seems to be working fine, just accessing under /afs is the problem. On
other Redhat 6.x systems accessing /afs is fast and have no problems.
Are the nsswitch and DNS resolver configurations the same on all systems?
Any differences in network restrictions?
Does it help to run afsd without -afsdb?

Just a wild guess,
Stephan
Post by John Sopko
/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
Note I tried fakestat-all to see if that would help, I have been
running just -fakesat, our db servers have afsdb records.
% pwd
/afs
% ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu/
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu/
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu/
I re-formatted the /usr/vice/cache partition and that did not help.
I cannot find any hardware problems, no clues in the syslog or on the
console, the system disk including the cache is on a raid1/mirror
disk. This is a Dell server and I run Dell OpenMange which is really
good at reporting system and especially disk errors.
% fs version
openafs 1.6.22.2
Distributor ID: RedHatEnterpriseWorkstation
Release: 6.10
The problem is intermittent but goes into device wait most of the
time, for example the first time ran fine, the second time it took
14.96 seconds.
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:14.96 0.0% 0+0k 0+0io 0pf+0w
Thanks for any help or ideas to try.
--
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany
John Sopko
2018-11-08 18:48:08 UTC
Permalink
nsswitch and DNS the same, the AFSDB records resolve fine, the
/afs/cs.unc.edu cell works fine, just not /afs.
Post by Stephan Wiesand
Post by John Sopko
I have been running two legacy Redhat 6.x web servers for several
years. The apache httpd processes started to go into device wait state
the last few days on one of the servers, the other server is fine,
both are configured pretty much the same. I tracked this down to the
web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
or cat /afs/.htaccess which does not exist, the commands take a long
time to complete and first go into device wait state, it can take
several minutes or they may hang indefinitely. The afs file system
seems to be working fine, just accessing under /afs is the problem. On
other Redhat 6.x systems accessing /afs is fast and have no problems.
Are the nsswitch and DNS resolver configurations the same on all systems?
Any differences in network restrictions?
Does it help to run afsd without -afsdb?
Just a wild guess,
Stephan
Post by John Sopko
/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
Note I tried fakestat-all to see if that would help, I have been
running just -fakesat, our db servers have afsdb records.
% pwd
/afs
% ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu/
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu/
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu/
I re-formatted the /usr/vice/cache partition and that did not help.
I cannot find any hardware problems, no clues in the syslog or on the
console, the system disk including the cache is on a raid1/mirror
disk. This is a Dell server and I run Dell OpenMange which is really
good at reporting system and especially disk errors.
% fs version
openafs 1.6.22.2
Distributor ID: RedHatEnterpriseWorkstation
Release: 6.10
The problem is intermittent but goes into device wait most of the
time, for example the first time ran fine, the second time it took
14.96 seconds.
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:14.96 0.0% 0+0k 0+0io 0pf+0w
Thanks for any help or ideas to try.
--
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany
--
John W. Sopko Jr.
University of North Carolina
Computer Science Dept CB 3175
Chapel Hill, NC 27599-3175

Fred Brooks Building; Room 140
Computer Services Systems Specialist
email: sopko AT cs.unc.edu
phone: 919-590-6144
Stephan Wiesand
2018-11-08 18:59:18 UTC
Permalink
Have you tried w/o -afsdb?
Post by John Sopko
nsswitch and DNS the same, the AFSDB records resolve fine, the
/afs/cs.unc.edu cell works fine, just not /afs.
Post by Stephan Wiesand
Post by John Sopko
I have been running two legacy Redhat 6.x web servers for several
years. The apache httpd processes started to go into device wait state
the last few days on one of the servers, the other server is fine,
both are configured pretty much the same. I tracked this down to the
web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
or cat /afs/.htaccess which does not exist, the commands take a long
time to complete and first go into device wait state, it can take
several minutes or they may hang indefinitely. The afs file system
seems to be working fine, just accessing under /afs is the problem. On
other Redhat 6.x systems accessing /afs is fast and have no problems.
Are the nsswitch and DNS resolver configurations the same on all systems?
Any differences in network restrictions?
Does it help to run afsd without -afsdb?
Just a wild guess,
Stephan
Post by John Sopko
/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
Note I tried fakestat-all to see if that would help, I have been
running just -fakesat, our db servers have afsdb records.
% pwd
/afs
% ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu/
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu/
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu/
I re-formatted the /usr/vice/cache partition and that did not help.
I cannot find any hardware problems, no clues in the syslog or on the
console, the system disk including the cache is on a raid1/mirror
disk. This is a Dell server and I run Dell OpenMange which is really
good at reporting system and especially disk errors.
% fs version
openafs 1.6.22.2
Distributor ID: RedHatEnterpriseWorkstation
Release: 6.10
The problem is intermittent but goes into device wait most of the
time, for example the first time ran fine, the second time it took
14.96 seconds.
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:14.96 0.0% 0+0k 0+0io 0pf+0w
Thanks for any help or ideas to try.
John Sopko
2018-11-08 19:41:07 UTC
Permalink
Wow! Removing -afsdb and adding our db servers in the CellServDB seems
to have fixed the problem. Does not make any sense, this machine and
others running many years with -afsdb. And fs listcells works when
-afsdb is used:

% fs listcells
Cell dynroot on hosts.
Cell cs.unc.edu on hosts toucan.cs.unc.edu quail.cs.unc.edu kiwi.cs.unc.edu.

% host -t AFSDB cs.unc.edu
cs.unc.edu has AFSDB record 1 kiwi.cs.unc.edu.
cs.unc.edu has AFSDB record 1 quail.cs.unc.edu.
cs.unc.edu has AFSDB record 1 toucan.cs.unc.edu.

Thanks for the help. Is this a known issue?
Post by Stephan Wiesand
Have you tried w/o -afsdb?
Post by John Sopko
nsswitch and DNS the same, the AFSDB records resolve fine, the
/afs/cs.unc.edu cell works fine, just not /afs.
Post by Stephan Wiesand
Post by John Sopko
I have been running two legacy Redhat 6.x web servers for several
years. The apache httpd processes started to go into device wait state
the last few days on one of the servers, the other server is fine,
both are configured pretty much the same. I tracked this down to the
web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
or cat /afs/.htaccess which does not exist, the commands take a long
time to complete and first go into device wait state, it can take
several minutes or they may hang indefinitely. The afs file system
seems to be working fine, just accessing under /afs is the problem. On
other Redhat 6.x systems accessing /afs is fast and have no problems.
Are the nsswitch and DNS resolver configurations the same on all systems?
Any differences in network restrictions?
Does it help to run afsd without -afsdb?
Just a wild guess,
Stephan
Post by John Sopko
/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
Note I tried fakestat-all to see if that would help, I have been
running just -fakesat, our db servers have afsdb records.
% pwd
/afs
% ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu/
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu/
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu/
I re-formatted the /usr/vice/cache partition and that did not help.
I cannot find any hardware problems, no clues in the syslog or on the
console, the system disk including the cache is on a raid1/mirror
disk. This is a Dell server and I run Dell OpenMange which is really
good at reporting system and especially disk errors.
% fs version
openafs 1.6.22.2
Distributor ID: RedHatEnterpriseWorkstation
Release: 6.10
The problem is intermittent but goes into device wait most of the
time, for example the first time ran fine, the second time it took
14.96 seconds.
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:14.96 0.0% 0+0k 0+0io 0pf+0w
Thanks for any help or ideas to try.
--
John W. Sopko Jr.
University of North Carolina
Computer Science Dept CB 3175
Chapel Hill, NC 27599-3175

Fred Brooks Building; Room 140
Computer Services Systems Specialist
email: sopko AT cs.unc.edu
phone: 919-590-6144
Stephan Wiesand
2018-11-08 19:53:54 UTC
Permalink
My guess is that attempting to retrieve SRV and then AFSDB DNS
records for an "htaccess" top level domain is very slow to fail
on the problematic system for some reason.

I think it's kind of a known issue which has crept up in the past
for things like ".trash" as well.

You could probably find out where things get stuck by comparing
tcpdump outputs.

- Stephan
Post by John Sopko
Wow! Removing -afsdb and adding our db servers in the CellServDB seems
to have fixed the problem. Does not make any sense, this machine and
others running many years with -afsdb. And fs listcells works when
% fs listcells
Cell dynroot on hosts.
Cell cs.unc.edu on hosts toucan.cs.unc.edu quail.cs.unc.edu kiwi.cs.unc.edu.
% host -t AFSDB cs.unc.edu
cs.unc.edu has AFSDB record 1 kiwi.cs.unc.edu.
cs.unc.edu has AFSDB record 1 quail.cs.unc.edu.
cs.unc.edu has AFSDB record 1 toucan.cs.unc.edu.
Thanks for the help. Is this a known issue?
Post by Stephan Wiesand
Have you tried w/o -afsdb?
Post by John Sopko
nsswitch and DNS the same, the AFSDB records resolve fine, the
/afs/cs.unc.edu cell works fine, just not /afs.
Post by Stephan Wiesand
Post by John Sopko
I have been running two legacy Redhat 6.x web servers for several
years. The apache httpd processes started to go into device wait state
the last few days on one of the servers, the other server is fine,
both are configured pretty much the same. I tracked this down to the
web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
or cat /afs/.htaccess which does not exist, the commands take a long
time to complete and first go into device wait state, it can take
several minutes or they may hang indefinitely. The afs file system
seems to be working fine, just accessing under /afs is the problem. On
other Redhat 6.x systems accessing /afs is fast and have no problems.
Are the nsswitch and DNS resolver configurations the same on all systems?
Any differences in network restrictions?
Does it help to run afsd without -afsdb?
Just a wild guess,
Stephan
Post by John Sopko
/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
Note I tried fakestat-all to see if that would help, I have been
running just -fakesat, our db servers have afsdb records.
% pwd
/afs
% ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu/
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu/
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu/
I re-formatted the /usr/vice/cache partition and that did not help.
I cannot find any hardware problems, no clues in the syslog or on the
console, the system disk including the cache is on a raid1/mirror
disk. This is a Dell server and I run Dell OpenMange which is really
good at reporting system and especially disk errors.
% fs version
openafs 1.6.22.2
Distributor ID: RedHatEnterpriseWorkstation
Release: 6.10
The problem is intermittent but goes into device wait most of the
time, for example the first time ran fine, the second time it took
14.96 seconds.
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
% time ls -l
total 4
lrwxr-xr-x 1 root root 10 Dec 31 1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar 6 2015 cs.unc.edu
lrwxr-xr-x 1 root root 10 Dec 31 1969 unc -> cs.unc.edu
0.000u 0.000s 0:14.96 0.0% 0+0k 0+0io 0pf+0w
Thanks for any help or ideas to try.
Jeffrey Altman
2018-11-08 20:42:01 UTC
Permalink
On 11/8/2018 12:22 PM, John Sopko wrote:>
Post by John Sopko
/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
-dynroot

do not mount a root.afs volume. instead populate the /afs directory
with the results of cell lookups

-afsdb

if the requested name does not match a cell found in the CellServDB
file, query DNS first for SRV records and if no match, then AFSDB
records

Note that default RHEL6 configuration for the DNS resolver does not
cache negative DNS results.

An attempt to open /afs/.htaccess therefore results in DNS queries for
"htaccess" plus whatever domains are in the search list. If the search
list is cs.unc.edu and unc.edu then for each access there will be the
following DNS queries

SRV _afs3-vlserver._udp.htaccess.cs.unc.edu
SRV _afs3-vlserver._udp.unc.edu
AFSDB htaccess.cs.unc.edu
AFSDB htaccess.unc.edu

You can add a dummy htaccess.cs.unc.edu entry to CellServDB. You can
add a blacklist for that name. You can stop using -afsdb or you can
stop using -dynroot and rely upon a locally managed root.afs volume.

Jeffrey Altman
John Sopko
2018-11-09 15:45:39 UTC
Permalink
Thanks for the explanation. I had never had this issue for years, my
guess is we have more .htaccess files being created and accessed in
afs. After researching when a .htaccess file is encountered, the
server then traverses up the file system looking for .htacces files in
all parent directories. By default apache configures / with
"AllowOverride None" which tells the server .htaccess is not allowed
and don't traverse. I added /afs and our cell as show below, no need
to look for .htaccess in these top level directories.


# Each directory to which Apache has access can be configured with respect
# to which services and features are allowed and/or disabled in that
# directory (and its subdirectories).
#
# First, we configure the "default" to be a very restrictive set of
# features.
#
<Directory />
Options FollowSymLinks
AllowOverride None
</Directory>

<Directory /afs>
AllowOverride None
</Directory>
<Directory /afs/cs.unc.edu>
AllowOverride None
</Directory>
<Directory /afs/.cs.unc.edu>
AllowOverride None
</Directory>
Post by Jeffrey Altman
On 11/8/2018 12:22 PM, John Sopko wrote:>
Post by John Sopko
/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
-dynroot
do not mount a root.afs volume. instead populate the /afs directory
with the results of cell lookups
-afsdb
if the requested name does not match a cell found in the CellServDB
file, query DNS first for SRV records and if no match, then AFSDB
records
Note that default RHEL6 configuration for the DNS resolver does not
cache negative DNS results.
An attempt to open /afs/.htaccess therefore results in DNS queries for
"htaccess" plus whatever domains are in the search list. If the search
list is cs.unc.edu and unc.edu then for each access there will be the
following DNS queries
SRV _afs3-vlserver._udp.htaccess.cs.unc.edu
SRV _afs3-vlserver._udp.unc.edu
AFSDB htaccess.cs.unc.edu
AFSDB htaccess.unc.edu
You can add a dummy htaccess.cs.unc.edu entry to CellServDB. You can
add a blacklist for that name. You can stop using -afsdb or you can
stop using -dynroot and rely upon a locally managed root.afs volume.
Jeffrey Altman
--
John W. Sopko Jr.
University of North Carolina
Computer Science Dept CB 3175
Chapel Hill, NC 27599-3175

Fred Brooks Building; Room 140
Computer Services Systems Specialist
email: sopko AT cs.unc.edu
phone: 919-590-6144
Loading...