[OpenAFS] Tracing VLDB queries

Discussion:

Robert Milkowski

2018-12-03 11:56:23 UTC

Hi,

Just a teaser:

# dtrace -q -n vlserver*:::GetEntry-done \

'{printf("%Y client: %s volname: %s rc: %s\n", walltimestamp, \

args[1]->ci_remote, args[3]->volname, afs_errorstr[args[0]]);}'

2018 Nov 19 15:55:15 client: 10.170.57.130 volname: zz.vldb.13 rc: OK

2018 Nov 19 15:55:28 client: 10.170.57.130 volname: ms.dist rc: OK

2018 Nov 19 15:55:57 client: 10.170.57.130 volname: ms.distXX rc: VL_NOENT

...

This is a probe provided by vlserver directly. Obviously one can achieve the
same by using PID provider, but it gets more complicated and does require
understanding of the code and is more involved.

This works on Solaris and should work on FreeBSD as well. It shouldn't be
hard to get it working with SystemTAP on Linux either (although looks like
Linux will be going with ebpf in the future).

Best regards,

Robert Milkowski

Jeffrey Altman

2018-12-03 14:51:02 UTC

Permalink

Hi,
Â
Â
# dtrace -q -n vlserver*:::*GetEntry-done*Â \
Â Â '{printf("%Y client: %s volname: %s rc: %s\n", walltimestamp, \
Â Â Â Â args[1]->ci_remote, args[3]->volname, afs_errorstr[args[0]]);}'
Â
2018 Nov 19 15:55:15 client: 10.170.57.130 volname: zz.vldb.13 rc: OK
2018 Nov 19 15:55:28 client: 10.170.57.130 volname: ms.dist rc: OK
2018 Nov 19 15:55:57 client: 10.170.57.130 volname: ms.distXX rc: VL_NOENT
...
Â
This is a probe provided by vlserver directly. Obviously one can achieve
the same by using PID provider, but it gets more complicated and does
require understanding of the code and is more involved.
This works on Solaris and should work on FreeBSD as well. It shouldnât
be hard to get it working with SystemTAP on Linux either (although looks
like Linux will be going with ebpf in the future).

Hi Robert,

Once trace points are added tracing can be used to answer all sorts of
questions. The specific output from your example is very similar to the
data that is collected by the baked in audit infrastructure. The
following is output from AuriStorFS vlserver:

Mon Dec 03 06:25:30 2018 [71] EVENT AFS_VL_GetEntByN CODE 363524 NAME
--UnAuth-- HOST [204.29.154.74]:7001 STR symbols

Mon Dec 03 06:35:08 2018 [71] EVENT AFS_VL_GetEntByN CODE 0 NAME
--UnAuth-- HOST [204.29.154.72]:7001 STR 536872388

Mon Dec 03 06:35:10 2018 [71] EVENT AFS_VL_GetEntByN CODE 0 NAME
--UnAuth-- HOST [204.29.154.72]:7001 STR 536872388

Mon Dec 03 06:36:05 2018 [71] EVENT AFS_VL_GetEntByN CODE 0 NAME
--UnAuth-- HOST [2604:2000:1741:a019:6d77:8346:dab0:49c0]:7001 STR root.cell

Mon Dec 03 06:36:05 2018 [71] EVENT AFS_VL_GetEntByN CODE 0 NAME
--UnAuth-- HOST [2604:2000:1741:a019:6d77:8346:dab0:49c0]:7001 STR
root.public

In OpenAFS the audit infrastructure can be enabled per-service and its
output can be set to files, named pipes, syslog, Linux message queues
and on AIX its integrated with the OS auditing system.

Jeffrey Altman

Robert Milkowski

2018-12-03 15:02:55 UTC

Permalink

Once trace points are added tracing can be used to answer all sorts of questions.

Yes. We've been tracing such information (and others) using DTrace/PID provider for years, however having those more abstract probes is going to make it easier for everyone,
especially in cases where you are not debugging a bug in AFS itself but rather debugging some peculiar client(s) behaviour, etc.

The specific output from your example is very similar to the
data that is collected by the baked in audit infrastructure. The

[...]

In OpenAFS the audit infrastructure can be enabled per-service and its
output can be set to files, named pipes, syslog, Linux message queues
and on AIX its integrated with the OS auditing system.

Yes, for this specific example you are right.
However the new probes allow you to easily answer many other questions, like what's the distribution of how long queries take (and then per client, per volume, etc.),
or correlate queries to I/O to disks or network traffic, or CPU, etc.