Lease Management In A Distributed File System

Lease Management In A Distributed File System

Coordinating Client Write Requests with Lease Mechanisms In Golang

What is a Lease?

A lease is a contract where one party gives another party access to a resource for a specific period. After this period expires, the party loses access to the resource.

In computing, a lease is a time-limited contract granting specific rights to a resource, serving as an alternative to a lock for managing access.

Quite simple right? ๐Ÿซ . Let's take a look

Scenario

A user wants to upload an image to a single server, a write request. Writing a file to a single server is fine as long as there is no race condition on that server. This is why locks are used for synchronization.

Let's add a new server and imagine what will happen. In this scenario, multiple users will write to either server using the same approach, but with the same problem and a new one. How will the user retrieve the file from either server without knowing the server information? One might argue that each server could return the ID for each written file, but what if we have several servers? That means each file's ID location must be stored too. How do we coordinate writes from several clients? If any server goes down, we are in trouble ๐Ÿ”ฅ. No user would like that.

Can we make this better?

Yes, let's assume we have three servers, Server A, Server B, and Server C. A client wants to write a file to a server. The client wants to write data to a file so the client sends a write request to Server A but before Server A can perform the write operation, it must first check its lease status. If it currently holds the lease for the file, it proceeds to write to the file otherwise it acquires a lease from the lease manager which is another server.

The lease manager grants Server A a lease for a specific duration (e.g., 30 seconds), allowing it to perform write operations exclusively on the file.

Server A starts writing the data to its local storage. Concurrently, Server A replicates the write operation to Server B and Server C to ensure data redundancy.

Server A keeps track of the lease duration. If the lease is about to expire and the write operation is not yet completed, it must renew the lease from the lease manager. If the lease manager grants the renewal, Server A can continue the write operation. If not, Server A must abort the operation and inform the client about the lease expiration. While Server A aborts its operation, it notifies Server B and Server C about the lease expiration.

In case there is a failure on Server A, Server B or Server C can now request a lease from the lease manager to become the new primary server for the file. The client is informed about the new primary server (Server B) to direct its future write requests accordingly.

Let's see some snippet examples to better grasp the above scenario.

Code

Most of the code snippets in this article can be found in my distributed system repository on GitHub. Let's define our lease type to see how it is represented.

type Lease struct {
    Handle      ChunkHandle 
    Expire      time.Time
    InUse       bool
    Primary     ServerAddr 
    Secondaries []ServerAddr
}

func (ls *Lease) IsExpired(u time.Time) bool {
    return ls.Expire.Before(u)
}

The master server in the code below acts as the lease manager and contains all the necessary information to grant a lease to a server performing a write. In this case, a server sends a heartbeat at intervals to the master server with a flag indicating whether the current lease should be extended for the requesting server.

func (ma *MasterServer) RPCHeartBeatHandler(args rpc_struct.HeartBeatArg, reply *rpc_struct.HeartBeatReply) error {
    firstHeartBeat := ma.chunkServerManager.HeartBeat(args.Address, args.MachineInfo, reply)
    if args.ExtendLease {
        var newLeases []*common.Lease
        for _, lease := range reply.LeaseExtensions {
            chk, err := ma.chunkServerManager.extendLease(
                    lease.Handle, lease.Primary)
            if err != nil {
                log.Err(err).Stack().Msg(err.Error())
                continue
            }

            newLeases = append(newLeases, &common.Lease{
                Expire:      chk.expire,
                Handle:      lease.Handle,
                InUse:       false,
                Primary:     lease.Primary,
                Secondaries: lease.Secondaries,
            })
        }

        reply.LeaseExtensions = newLeases
    }

    // other logic - not important to this article 

}

The ongoing write operation continues if the lease extension request is granted to the file server. For my use case, it assumes that once a write operation starts, it will be completed without being cancelled due to lease expiry.

So, how do we grant a lease to a server for an initial write request from a client, allowing them to push the data to the primary server for writing? Here is an example of an asynchronous request to a server that wants to perform a write with a granted lease.

go func() {
        var (
            args  rpc_struct.GrantLeaseInfoArgs
            reply rpc_struct.GrantLeaseInfoReply
        )
        args.Expire = lease.Expire
        args.Primary = lease.Primary
        args.Secondaries = lease.Secondaries
        args.Handle = lease.Handle
        err := utils.CallRPCServer(string(lease.Primary), "ChunkServer.RPCGrantLeaseHandler", args, &reply)
        if err != nil {
            log.Err(err).Stack().Send()
            log.Warn().Msg(fmt.Sprintf("could not grant lease to primary = %v", chk.primary))
        }
}()

This snippet shows the lease manager informing the server that a lease was successfully granted. This allows the client to start the write operation within the allocated time, as shown in the code below:

func (cs *ChunkServer) RPCWriteChunkHandler(args rpc_struct.WriteChunkArgs, reply *rpc_struct.WriteChunkReply) error {
    data, ok := cs.downloadBuffer.Get(args.DownloadBufferId)
    if !ok {
        reply.ErrorCode = common.DownloadBufferMiss
        return fmt.Errorf(
            "could not locate %v in buffer (might have expired ...)",
            args.DownloadBufferId)
    }

    // calculate the next offset from the prevous cursor position
    // assumption is that the data in the buffer is greated than 64 << 20
    dataSize := bToMb(uint64(args.Offset) + uint64(len(data)))
    if dataSize > common.ChunkMaxSizeInMb {
        return fmt.Errorf("provided data size for write action [%v] is larger than the max allowed data size of %v mb",
            args.DownloadBufferId, common.ChunkMaxSizeInMb)
    }

    log.Info().Msgf("args.Replicas => %#v", args.Replicas)

    lease := cs.leases.PopFront()
    if lease == nil || lease.IsExpired(time.Now()) {
        return fmt.Errorf("could not acquire Write lease / lease has expired")
    }

    if err := doWriteOperation(args, cs, data); err != nil {
        return err
    }
    cs.leases.PopFront()
    return nil
}

Once the write operation starts, the server can replicate the data to other servers. A hand-drawn diagram of this process is shown below:

There are several ways to go about implementing lease management, This is just one simple example.

Benefits of Lease Management

The following are the benefits of having a lease management system in a distributed system:

  1. Consistency: It ensures that only one server holds the lease to perform write operations at any given time, maintaining consistency.

  2. Availability: Allows failover to backup servers if the primary server fails or the lease expires, ensuring high availability.

  3. Coordination: Coordinates write operations and replication effectively, reducing conflicts and data corruption.

Lease management is a crucial topic in distributed systems. How you implement it depends on your specific project. I hope this article gives insight into what you can achieve with leases.

I am Caleb and you can reach me on Linkedin or follow me on Twitter. @Soundboax