您的当前位置：首页 Causally consistent recovery of partially replicated logs

Causally consistent recovery of partially replicated logs

来源：华拓网

i t Causally Consistent Recovery of Partially Replicated Logs Kenneth Kenneth P. P. Birman* Kane November 1988 88-949 Department of Computer Science Cornell University Ithaca, NY 14853-7501 , 'This work was supported by DARPA (DoD) under ARPA order 6037, Contract N00140-87-C-04, also by a grant from the Siemens Corporation. The views, opinions and findings contained in this report are those of the authors and should not be construed as an official Department of Defense position. policy or decision. Causally Consistent Recovery of Partially Replicated Logs Kenneth P. Kane and Kenneth P. Birman Cornell University Computer Science Department Ithaca, New York 14853 USA November 28, 1988 Abet ract An algorithm is presented for the consistent recovery of replicated data in a client-server system. The algorithm is based on logging and is similar to the optimistic techniques that are well known in the literature. However, unlike in existing optimistic techniques, explicit dependency information is. not maintained. Instead, dependency in- formation is estimated from the ordering of messages found in servers’ logs. These dependency estimates can, in general, be expensive to compute. It is therefore shown how inexpensive estimates can be ap- plied when a system is well structured. 1 Introduction Object oriented distributed systems are becoming increasingly common. These systems provide users with tools for building abstract data objects. Such an object generally consists of routines for maintaining it along with an interface by which clients access it. Only the interface of an object is visible to a client; implementation details, such as replication and failure I luor Rosourco Allocation Implo- Implo- montat ion montat ion Intorf .GO Inrorfaco Figure 1: A portion of a distributed operating system. Depicted are two objects representing a name server and a resource allocation man- ager. Clients or processes in the system operate by first registering themselves with the naming service and then allocating resources under that name. recovery, are hidden within the object module. Figure 1 depicts such a sys tern. Failure recovery in these systems is often accomplished through the use of logging. By writing to a log file the sequence of updates that occurs to an object, the object’s state can be reconstructed after a failure. How- ever, because the states of objects may be related, consistency problems potentially arise if object logs are not coordinated. For example, in the system of figure 1 the state of the resource manager is dependent on the state of the name server; only registered clients may allocate resources. Suppose a failure causes a client registration record to be lost (not logged). If resource allocations are logged for this client, then the system may later recover into a state that reflects the client’s allocations without reflecting its registration. Transactions can be used to enforce consistency between logs. For ex- ample, the registration of a client name and its allocation of resources could 2 be grouped into a single transaction and committed as a unit, in order to ensure that they are recovered atomically. However, many applications do not require the full power of atomicity that transactions provide. Of- ten a weaker form of consistency, such as causal consistency, is sufficient to guarantee correctness [Lam78,BJ87a]. In loosely coupled systems such as ISIS [BJ87a], this weakening of consistency usually leads to improved performance and availability. This paper presents a log-based mechanism for the causally consistent recovery of replicated data objects. The problem of representing and main- taining causal dependency information about the updates on objects is not a simple one. Solutions to this problem have been devised for many different settings, including inter-process communication [BJ87b,PBS88], highly available distributed services [LL86], and optimistic failure recovery [SY85,J288]. Dependency information in these systems is maintained ex- plicitly: each object update is tagged with either an enumeration of the updates on which it is dependent or with a timestamp that reflects the update’s causal ordering. Unfortunately, it can be difficult or impossible to maintain explicit de- pendency information about updates when the set of object clients is ei- ther unknown or large and dynamically changing. The recovery algorithm presented in this paper avoids the need to maintain explicit dependency information by estimating such information from the ordering of updates in object servers’ logs. When an object server first recovers from a failure, it approximates the set of dependencies in the system from ordering infor- mation available in the logs of servers. This information is then used to ensure that only consistent object states are recovered. The presentation of the algorithm is divided into two parts. First, in sections 2 through 6, a recovery algorithm is derived based on explicit knowledge of the dependencies between object updates. In section 2, the formal system model is presented and in section 3 the notions of consis- tency and correctness are defined. Based on these definitions, section 4 outlines several consistency problems that arise through the use of logging 3 and presents a basic sketch of the recovery mechanism. The actual imple- mentation of the recovery algorithm is built on functions for consistently adding and deleting entries from logs. These functions are presented in section 5 and used in section 6 to describe the recovery algorithm. The second part of the presentation discusses methods for estimating dependency information from the ordering of updates in object logs. Sec- tion 7 presents several dependency estimates and describes how they can be used in the recovery algorithm in place of the values they approximate. In general, the estimates used can be expensive to compute. Because of this, section 8 describes a special class of systems in which inexpensive estimates can be used by the algorithm. The material presented in this paper is a summary of that in [Kan]. Much of the formalism and all of the proofs have been omitted for the purposes of brevity. 2 System Model 2.1 Partial Replication Our system model is a partially replicated variation of the client-server model of computation. A set of servers, denoted SERV, are used to maintain replicas of a set of data objects. Each server maintains replicas of several different objects. Data objects are not fully replicated: each object is replicated at only some of the servers. We let D denote the set of all data objects in the system and let S E RVA denote the subset of servers managing a replica of object A (A E D). Objects are accessed by a set of clients that may or may not differ from the set of servers. In order to access an object, A E D, a client broadcasts a request to all servers of A, that is, to all members of SERVA. Upon receiving such a request, each server of object A performs the requested operation on its local copy of A. We make no assumption about the relative ordering of client requests. Jobs Jobs Cornps Jobs Comps Comps job1 qj6 j& -PI -PI .; client ........ 2';. ......... ... .l .... ........ .,' 'client ......... 1 ';. .:'client ......... 2';, Figure 2: An example of a partially replicated printer service. In figure 2(a), both system clients are broadcasting job submissions. In fig- ure 2(b), the job submissions have completed with the second client broadcasting a notification of the completion of the first job. Servers may receive the same requests in differing orders, if the orders are mutually consistent and correct with respect to the application being implemented. It is the responsibility of clients to ensure that such correct orderings are perceived by the servers. To this end, clients may use a variety of broadcast mechanisms, each differing in the ordering properties it provides. As an example, consider figure 2. This figure depicts two states in the execution of a system maintaining information about a printer service. The system consists of three servers (f, 9, and h) replicating two data objects (Jobs and Comps). The object Jobs is a list of jobs that have been submitted for printing and is replicated at servers f and 9. The object Comps is a list of completed jobs and is replicated at servers g and h. Figure 2(a) depicts a state of the system in which two clients (client 1 and client 2) are submitting jobs for printing. Note that the job submissions will be received in different orders by the two servers of object 5 I Jobs. Figure 2(b) depicts a later state of the system after which both job submission broadcasts (jobl and jobl) have completed. In this state, client 2 is in the process of broadcasting a completion notification for jobl. 2.2 Logging In order to support recovery from failures, each server maintains a log of the requests that it receives. Definition 2.1 A log is a totally ordered set (L, -+L) of requests. Here, L is the set of requests received by a server and -'L is the order in which those requests were received. Only update requests are actually logged. Read only requests are omitted because they do not affect an object's state. Note that because servers may receive requests in different orders, they may also log requests in different orders. Definition 2.2 The projection of a log, (L, +L), onto an object, A E D, is the set of object A requests in the log. Formally, (L, -L) = { x E L I x is a request on object A } In order to decouple the execution speed of servers from the speed of logs, servers maintain their logs asynchronously. No coordination occurs between the logs of different servers. In addition, no coordination occurs between the state of a server and its log. The state of a log may often lag behind the state of its server. (This approach is orthogonal to that of write-ahead logging where the state of an object and its log are always synchronized [BHG87].) 2.3 Failures Servers fail by crashing [SSSS]. When a server crashes, it immediately ceases to receive, process, and log client requests. We will not address the problem 6 # I .'._.__..' I I I I I I I I 0 I I I I I I @ 0 I ....... 1 .'.client ......_..' 2 I time 0 time t, time I ta Figure 3: One possible execution of the printer service of figure 2. Depicted are two job submission broadcasts (job, and job2) along with one job completion notification (cmpl). Also depicted are two fail- ures. Server f fails at time tl and server g fails at time t2. In the diagram, horizontal tines represent process (server or client) ex- ecutions while diagonal arrows represent request message broad- casts. Dotted lines represent the logging of request messages by servers. The length ofa dotted line indicates the latency between the receipt of a request and its physical logging. of server partitions [DGMS85]. When a server is functioning, we assume that it can communicate with all other functioning servers. Figure 3 depicts a possible execution of the printer service shown in fig- ure 2. In the example, server f fails at time tl after receiving (and logging) jobl, but before receiving jobz. Server g fails at time tz after receiving (and logging) all three requests. Server h functions continuously through out the example, receiving and logging the job completion notification compl. Note that the final logs of servers f and g do not agree on the state of object Jobs. Not only do they contain different requests for the object, but they reflect different orders on those requests. 2.4 System State At the time a server recovers, the objects in the system can be divided into two categories. An active object is one for which some server is actively managing a replica. An inactive object is one for which all servers of the object have failed or are in the process of restoring their replicas. For the purpose of recovery, the state of the system can be summarized in the following manner: Definition 2.3 A state of the system is characterized by the following val- ues: For each data object, .4 E D: .4CTA The set of servers actively managing a replica of ob- ject A. RECA The set of servers in the process of restoring their replicas of object A. FALA The set of failed servers of object A. For each server, f E SERV: (L,, -,) The log of server f. 8 Figure 4: A state from the printer service execution given in figure 3. De- picted is the state of the system immediately after time t1. As an example, consider again the execution of figure 3. Figure 4 shows the state of this system immediately after time tl. 3 Causal Dependencies During the execution of a system clients can interact with one another.' These interactions often lead to data dependencies between the requests the clients issue. For example, in figure 2 the job completion notification campl is causally dependent* on the job submission jobl: a job cannot complete until afler it has been submitted. Causal dependencies restrict the set of correct request orderings that can be perceived by servers. A server should never receive two causally related requests out of causal order. Earlier it was stated that servers may receive requests in differing orders, provided that those orders are correct for the application. This can be stated more precisely by saying that servers may receive requests in any order consistent with causality. 'Clients can interact either directly, by sending messages to one another, or indirectly, through the objects managed by the servers. 2Many types of dependencies can exist between client requests. In this paper, however, we will focus on causal dependencres. 9 Figure 5: A request system representing the dependencies in the printer ser- vice. The system consists of three requests: two job submissions and a job completion notification. The only causal dependency in the system is the one between the completion notification of job1 and its submission. 3.1 Request Systems The causal dependency structure of an application can be summarized by means of a request system. Definition 3.1 A request system is a partially ordered set (R, +R) of re- quest s. Here, R is the set of all requests made by clients in the system and +R is a partial order that relates all pairs of causally dependent requests. The partial order +R may be interpreted as meaning that if two requests are related, z ) = addMS, (deleteNR, (Lj , *f)) Actually, the new state recovered for object A may not be totally con- sistent with the states of active objects. It is possible that an active object may have a dependency on a request that is not recovered for object A. This can happen, for example, if the dependent request was never logged or because the servers that did log it never recovered in time to take part in the ACTIVATE phase. When this problem of a missing dependent occurs, the ACTIVATE phase must abort the restoration of object -4. It must then wait for additional servers of object A to recover (hopefully with the miss- ing dependent present in one of their logs) before re-attempting to activate the object. 7 Dependency Estimation The implementations of the JOIN and ACTIVATE phases assume that servers have knowledge of = { (3.B I 3f ESERVA ~SERVB : (~..4,y.B E Lf A 3.B 2.A O.W. \\ A icono(y.B < x.A)) } 7.4 Compound Estimates Transitive dependencies are estimated by approximating the sequences of direct dependencies out of which they are built. In presenting these esti- mates, the following definition will be useful: Definition 7.5 A chain, H, is a sequence of related objects. Intuitively, a chain represents a sequence of objects along which a transitive dependency may occur. If a chain such as H exists, then it is possible for an object -4, request to be dependent on an object A1 request through a sequence of dependencies on requests on objects An-,, An-2, . . . , Az. However, there is no guarantee that such a transitive dependency exists. The existence of a chain only implies the potential for such a dependency. We let AB-CHAINS denote the set of all chains from object -4 to object B. The following definitions, based on chains, will also be useful: 28 Definition 7.6 A sub-chain of a chain, H, is any subsequence of its objects where 1 1. ml < m2 < . . . < mp 5 n. Definition 7.7 The AiAj sub-chain of a chain, H, is the sub-chain of objects from A, to A,: 7.4.1 Dependency Set We denote our compound estimate of DEPB(z.A), the object B dependents of request x.A, by the request set deps(2.A). This estimate, like the basic estimate, has the property that when defined it contains all of the object B dependents of request z.A, plus possibly a few extraneous requests. This estimate is built out of estimates of dependencies along individual chains. In order to estimate the object B dependents of request z.~, the dependents along each chain from A to B are separately estimated. These estimates are then combined to form a complete estimate of the dependency set. Specifically, let H denote any chain. We let deps(zn.An) denote our estimate of the object A1 dependents of request zn.An that occur along chain H. That is, dep'$(z,.A,) estimates the set of object A, requests that are related to z,.~, by a sequence of dependencies on the objects in chain H. 29 The estimate depz(zn.A,) can be formed in many ways. First, if there is a server that manages replicas of both objects Al and A,, then an estimate In can be obtained by simply applying the basic estimate dep\\,(z,.A,). general, however, the server sets of objects A1 and -4, will not overlap unless the objects are directly related. Alternately, an estimate can be formed by subdividing the problem. That is, an estimate can be formed by first choosing some object in the estimating the object A, dependents of z,.A,, and chain, A, (1 < i < n), then estimating the object AI dependents of the object A; dependents. Again, if the server sets of objects Al and A, overlap, and the server sets of objects A, and A, overlap, the basic estimates can be applied to solve each these sub-problems. That is, if the server sets overlap, an estimate can be formed directly along the sub-chain However, this is not likely to be the case unless the pairs of objects are directly related. If the server sets do not overlap, then each of the sub- problems will have to be further subdivided in a manner similar to the original problem. In general, the problem will have to be sub-divided until a sub-chain of H is found 1 < ml < m2 4 ... < mp < n in which each pair of adjacent objects have overlapping server sets. An estimate can then be formed along this sub-chain by first approximating the object A,,,, dependents of z,.~,; then approximating the object A,,-, dependents of the object A,, dependents; and similarly approximating the dependents for each successive object down the sub-chain. This process can be specified recursively in the following manner. Note that the estimate is extended to operate on sets of requests. That is, 30 depg(Q) denotes the set of object AI dependents of the object A,, requests in Q:' where 1 < i < n is chosen so that the estimates are defined. Note that there may be several choices of i for which the estimates are defined. Each will likely yield a slightly different approximation of the true dependency set. However, each is guaranteed to contain all of the true dependencies that occur along H. Because of this, an even more accurate estimate of the true dependency set (one with fewer extraneous requests) can be formed by intersecting the estimates from each choice of i. The estimate can thus be modified as follows: Note that the definitions of union and intersection must be extended to take into account the possibility of undefined approximations. This is done as follows: I if 32 : S, =I = ( usi O.W. t 'The length of a chain, H, is denoted by ~~This His ~the number ~. of objects in the chain. 31 The estimate of the complete set of object B dependents of request x..4 is formed by combining the estimates of dependency along each chain from B to A. Formally, deps(x.A) = HE B de& (2. A) A-CHAINS U 7.4.2 Request Ordering We denote our compound estimate of the predicate CON(2.A 4 y.B) by the predicate conW(z.A 3 y.B). Like the basic estimate, this predicate has the property that, when true, it is guaranteed that request y.B is not casually dependent on request 2.A. And, when false, the requests may or may not be related. As with the dependency set estimate, the request ordering estimate is built up from estimates of ordering along individual chains. For any AIA,- chain, H, we let cm~(zl .AI 4 z,.A,) denote our estimate of whether or not request z,.A, is causally dependent on request q.A1 along chain H. When the estimate is true, it is guaranteed that request z,.A, is not dependent on request zl.Al by a sequence of dependencies along the chain. The idea behind this estimate is to search the chain for an object, A,, at which any possible dependency path from 21 to x, is broken. That is, the chain is searched for an object, A,, such that none of the A; dependents of 2, are dependent on 21. The existence of such an object would imply that request 2, is not dependent on request 21 by a sequence of dependencies that include object A,. Because -4; is a part of chain H, this would in turn imply that the requests cannot be dependent along chain H. The estimate is formed by examining each object, A,, in the chain. For each such object, the dependents of request x,.~, are estimated. These dependent requests are then recursively tested to determine if any of them are dependent on zl.AI. The formal definition of this function is given below. Note that the estimate is extended to operate on sets of requests; that is, ~on~(z~..~~ 4 Q) denotes our approximation of whether or not any 32 of the object A, requests in Q are causally dependent, along chain H, on request zl.Al: cmWH(z1-Al 4 Q) = [ l

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文