[OpenAFS] Does vos release/volume replication always initiates data transfer from RW site?

Discussion:

Ximeng Guan

2018-08-06 03:58:03 UTC

Hello,

We have one cell covering two sites. The WAN bandwidth between the two sites is relatively low, so we use volume replication to speed up the access.

Those replicated volumes are often large in size. So replication to the remote site is an operation whose cost cannot be neglected.

Now with RW volumes at site A and their RO replication on servers at site B, we want to bring up a new file server at site B to balance the load. In other words we would like to "offload" a majority of the RO volumes from one server to a different server at Site B, without touching their RW masters at Site A.

The standard way to do this is through "vos addsite" and "vos release" to create new RO copies on the new server at site B. When we try this it seems that the "vos release" always initiates a fresh full volume of data transfer from the RW servers at site A to the new server at site B, even though there are RO copies at site B on other servers.

I wonder if there is a way to directly transfer those RO volumes btw servers at site B, without breaking the data integrity among the RO sites or affecting the atomicity of "vos release".

Or maybe a more greedy way to ask the question is: Is it possible to make the data transfer part of "vos release" smarter, by allowing it to select the closest path to copy data btw file servers, according to a network distance either detected by itself, or provided by the administrator? For example, once it is found that no change has been made to the RW volume since the last release, a "vos release" to a new site will search all the RO copies in the VLDB, find the closest RO site to the new server, and initiate data transfer from there, instead of from the RW site? Even if the RW volume has been changed, maybe it would still be much cheaper to copy the existing RO volume from a closest site first, and then broadcast the incremental changes from the RW site?

Thanks.

Ximeng (Simon) Guan

Royole Corporation

Jeffrey Altman

2018-08-06 12:28:17 UTC

Permalink

Post by Ximeng Guan
Hello,
We have one cell covering two sites. The WAN bandwidth between the two
sites is relatively low, so we use volume replication to speed up the
access.
Those replicated volumes are often large in size. So replication to the
remote site is an operation whose cost cannot be neglected.
Now with RW volumes at site A and their RO replication on servers at
site B, we want to bring up a new file server at site B to balance the
load. In other words we would like to âoffloadâ a majority of the RO
volumes from one server to a different server at Site B, without
touching their RW masters at Site A.
[...]>
I wonder if there is a way to directly transfer those RO volumes btw
servers at site B, without breaking the data integrity among the RO
sites or affecting the atomicity of âvos releaseâ.

AuriStorFS supports the desired functionality including the ability to
copy and move readonly sites between file servers or vice partitions
attached to the same file server.

https://www.auristor.com/openafs/migrate-to-auristor/

OpenAFS does not contain explicit functionality but it is possible using

vos dump
vos restore -id -readonly
vos addsite -valid

to achieve similar results. From the source server use "vos dump" to
generate a dump stream of the readonly volume you wish to replicate.
Pipe the output to "vos restore" specifying the destination server,
partition, the readonly volume id and the -readonly flag to specify the
volume type. Finally, use "vos addsite" with the -valid flag to update
the location service entry for the volume. The -valid flag indicates
that the readonly volume data is known to be present and consistent with
other sites. Note that the -valid switch will not mark a site as "new"
if a "vos release" failed to update one or more sites.

Be careful to use publicly visible addresses when executing these commands.

Jeffrey Altman

Todd Lewis

2018-08-06 12:44:57 UTC

Permalink

Given the two-site scenario below, and successful manual replication as
outlined by Jeffrey further below, you then have two RO volumes at site
B as desired.

If "vos release" is enough of a problem to warrant this manual
intervention, then won't subsequent releases from the RW site A now be
twice as consuming, and therefore justify (?) removal of the 2nd replica
prior to a release, then the release, followed by a repeat of the
dump/restore/addsite process to recreate the 2nd replica? This can be
scripted, but it's a balance between the extra work for the admin vs for
the machines/network.
--
Todd Lewis

Post by Jeffrey Altman

Post by Ximeng Guan
Hello,
We have one cell covering two sites. The WAN bandwidth between the two
sites is relatively low, so we use volume replication to speed up the
access.
Those replicated volumes are often large in size. So replication to the
remote site is an operation whose cost cannot be neglected.
Now with RW volumes at site A and their RO replication on servers at
site B, we want to bring up a new file server at site B to balance the
load. In other words we would like to “offload” a majority of the RO
volumes from one server to a different server at Site B, without
touching their RW masters at Site A.
[...]>
I wonder if there is a way to directly transfer those RO volumes btw
servers at site B, without breaking the data integrity among the RO
sites or affecting the atomicity of “vos release”.

AuriStorFS supports the desired functionality including the ability to
copy and move readonly sites between file servers or vice partitions
attached to the same file server.
https://www.auristor.com/openafs/migrate-to-auristor/
OpenAFS does not contain explicit functionality but it is possible using
vos dump
vos restore -id -readonly
vos addsite -valid
to achieve similar results. From the source server use "vos dump" to
generate a dump stream of the readonly volume you wish to replicate.
Pipe the output to "vos restore" specifying the destination server,
partition, the readonly volume id and the -readonly flag to specify the
volume type. Finally, use "vos addsite" with the -valid flag to update
the location service entry for the volume. The -valid flag indicates
that the readonly volume data is known to be present and consistent with
other sites. Note that the -valid switch will not mark a site as "new"
if a "vos release" failed to update one or more sites.
Be careful to use publicly visible addresses when executing these commands.
Jeffrey Altman

Ximeng Guan

2018-08-06 16:12:09 UTC

Permalink

Indeed if we have two RO replicas at Site B and RW at site A, subsequent release will be twice as consuming, unless "vos release" can be made to be even smarter, such that data is "relayed" among RO sites instead of being broadcasted from the RW site.

I guess that touches some fundamental part of "vos release" which is not trivial: How to optimize the data propagation path for different network scenarios? How does that affect the integrity among different replicas?

Thank you both for the practical advice.

Best regards,
Ximeng (Simon) Guan

-----Original Message-----
From: openafs-info-***@openafs.org [mailto:openafs-info-***@openafs.org] On Behalf Of Todd Lewis
Sent: Monday, August 6, 2018 5:45 AM
To: openafs-***@openafs.org
Subject: Re: [OpenAFS] Does vos release/volume replication always initiates data transfer from RW site?

Given the two-site scenario below, and successful manual replication as outlined by Jeffrey further below, you then have two RO volumes at site B as desired.

If "vos release" is enough of a problem to warrant this manual intervention, then won't subsequent releases from the RW site A now be twice as consuming, and therefore justify (?) removal of the 2nd replica prior to a release, then the release, followed by a repeat of the dump/restore/addsite process to recreate the 2nd replica? This can be scripted, but it's a balance between the extra work for the admin vs for the machines/network.
--
Todd Lewis

Post by Jeffrey Altman

Post by Ximeng Guan
Hello,
We have one cell covering two sites. The WAN bandwidth between the
two sites is relatively low, so we use volume replication to speed up
the access.
Those replicated volumes are often large in size. So replication to
the remote site is an operation whose cost cannot be neglected.
Now with RW volumes at site A and their RO replication on servers at
site B, we want to bring up a new file server at site B to balance
the load. In other words we would like to “offload” a majority of the
RO volumes from one server to a different server at Site B, without
touching their RW masters at Site A.
[...]>
I wonder if there is a way to directly transfer those RO volumes btw
servers at site B, without breaking the data integrity among the RO
sites or affecting the atomicity of “vos release”.

AuriStorFS supports the desired functionality including the ability to
copy and move readonly sites between file servers or vice partitions
attached to the same file server.
https://www.auristor.com/openafs/migrate-to-auristor/
OpenAFS does not contain explicit functionality but it is possible using
vos dump
vos restore -id -readonly
vos addsite -valid
to achieve similar results. From the source server use "vos dump" to
generate a dump stream of the readonly volume you wish to replicate.
Pipe the output to "vos restore" specifying the destination server,
partition, the readonly volume id and the -readonly flag to specify
the volume type. Finally, use "vos addsite" with the -valid flag to
update the location service entry for the volume. The -valid flag
indicates that the readonly volume data is known to be present and
consistent with other sites. Note that the -valid switch will not mark a site as "new"
if a "vos release" failed to update one or more sites.
Be careful to use publicly visible addresses when executing these commands.
Jeffrey Altman

_______________________________________________
OpenAFS-info mailing list
OpenAFS-***@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info
��(�~��X��X��^�R�w袗�i�(�

Jeffrey Altman

2018-08-06 17:28:35 UTC

Permalink

Post by Ximeng Guan
Indeed if we have two RO replicas at Site B and RW at site A, subsequent release will be twice as consuming, unless "vos release" can be made to be even smarter, such that data is "relayed" among RO sites instead of being broadcasted from the RW site.
I guess that touches some fundamental part of "vos release" which is not trivial: How to optimize the data propagation path for different network scenarios? How does that affect the integrity among different replicas?

In OpenAFS, all of the "smarts" are in the vos process. This process
has no knowledge of the network topology which is why OpenAFS has
difficulties in network environments that lack full end-to-end
connectivity between all peers.

In theory, someone could write a volserver topology map that could be
provided to "vos" as input. It could be used in conjunction with the
volume site list from the location service to decide how to replicate
the volumes.

The OpenAFS location service doesn't provide clients (cache managers)
any locality information and the Unix cache manager does not rank volume
sites based upon performance characteristics. How are you ensuring that
clients contact the local fileserver?

Jeffrey Altman

Ximeng Guan

2018-08-06 18:24:54 UTC

Permalink

For the last question, we instruct the administrator of each client to use "fs setserverprefs" to set preference. The server proximity is known a priori. Our IT tries to arrange the IP addresses such that clients stay within the same subnet to their closest servers, and different sites use different subnets. In the cases where IP similarity does not fully represent network distance, "fs setserverprefs" is used to manually override the preference.

-----Original Message-----
From: Jeffrey Altman [mailto:***@auristor.com]
Sent: Monday, August 6, 2018 10:29 AM
To: Ximeng Guan <***@royole.com>; openafs-***@openafs.org
Subject: Re: [OpenAFS] Does vos release/volume replication always initiates data transfer from RW site?

In OpenAFS, all of the "smarts" are in the vos process. This process
has no knowledge of the network topology which is why OpenAFS has
difficulties in network environments that lack full end-to-end
connectivity between all peers.

In theory, someone could write a volserver topology map that could be
provided to "vos" as input. It could be used in conjunction with the
volume site list from the location service to decide how to replicate
the volumes.

The OpenAFS location service doesn't provide clients (cache managers)
any locality information and the Unix cache manager does not rank volume
sites based upon performance characteristics. How are you ensuring that
clients contact the local fileserver?

Jeffrey Altman

:��T��&j)b� b�өzpJ)ߢ�^��좸