Hierarchical Byzantine Agreement. Nearly all of the Byzantine fault tolerant agreement strategies published in the literature assume a flat peer-to-peer communication model. This assumption does not necessarily mean that all the parties are connected with point-to-point links. But, it does mean that the algorithms/mechanisms used to tolerate Byzantine faults attempt to establish agreement among all of the participants as a whole set. A contrasting approach is to first reach agreement within small subsets, for example pairs, of the participants and then reach agreement among the subsets. One benefit of the latter approach is greatly reduced bandwidth. The following paragraphs describe Hierarchical Byzantine Agreement and how this bandwidth is saved. To reach agreement in Byzantine fault scenarios, multiple rounds of communication exchange are required. According to the well-developed understanding of Byzantine fault tolerance, an additional round of exchange is needed for every additional Byzantine fault to be tolerated. In nearly all of the published algorithms, each of the rounds of exchange requires retransmitting all the data received in the previous round. A reduced representation of a data message (e.g. a hash) could be sent out during all but the first round of exchanges. However, using a reduced representation would lead to reduced coverage for failures. The extreme case makes this point obvious—if the hash was simply parity, half of all failures would not be detected. Note that this hash should not be confused with the digital signatures referenced above. These two mechanisms serve totally different purposes. The hash would be used to reduce the size of messages in subsequent rounds of communication; whereas, digital signatures are used to reduce the number of fault containment zones required. In hierarchical Byzantine agreement, each subset of receivers agree among themselves whether they have all received a message correctly. If they have received the message correctly, they need only send out a single status bit that indicates this during the next round of exchange. Thus, subsequent rounds of communication that would require entire messages to be exchanged in the well-known Byzantine algorithms are now reduced to a single status bit that indicates whether the entire message was received correctly or not. One could construct a hierarchical Byzantine agreement scheme that relaxes this all-or-nothing agreement, but that is outside the scope of this report. Another variation of hierarchical Byzantine agreement does not explicitly exchange status data. Instead, the behavior of the subset as a whole is governed by the (implicitly) shared subset status. For example, if the members of a subset don’t agree on the reception status of a message sent into the subset, the subset may not perform their normal function of relaying that message to other subsets. The next section of this report describes how TTEthernet uses this variation of hierarchical Byzantine agreement. The first known use of hierarchical Byzantine agreement in a fielded system was the ARINC 659 (SAFEbus) [29]. In this case, the subsets were self-checking pairs. The Honeywell-designed SAFEbus interface hardware contains mechanisms that allow the halves of a pair to agree on whether they both received a message correctly or not, and also to automatically transmit a message containing the status of this agreement for other pairs to see. The agreement within a pair is achieved by employing a syndrome exchange between the halves of the pair. This exchange is transferred via local short-run wiring within the pair. Using this local wiring for the exchange eliminates the need for this agreement traffic to be carried on the SAFEbus itself. The status exchange between pairs forms the upper layer of the hierarchical Byzantine agreement mechanism, which allows all the pairs within a SAFEbus system to reach agreement. In this case, the size of the status is a 32-bit word, the smallest message that SAFEbus can send. Because the SAFEbus self-checking pairs include self-checking drivers and paired media [30], its failure semantics are constrained to be the asymmetric benign fault class. As with all statements that limit faulty behavior to a particular class, there is a non-zero probability for occurrence of some faulty behavior outside this class [12]. Whenever such statements are made in support of arguing that a system meets dependability requirements, analysis has to be done to show that these probabilities are sufficiently small (i.e., the total of all probabilities for ignored failure modes added to other failure probabilities must not exceed system requirements)3. In the case of SAFEbus, a failure can “escape” if both halves of a self-checking pair fail in such a way as to produce identical outputs. Because the halves are independent, this escape would require two failures. It can be shown that the probability of two failures within one self-checking pair is well below the maximum allowed probability. Thus, the SAFEbus hierarchical Byzantine agreement saves bandwidth costs in two ways. First, the communication between halves of a pair uses a very inexpensive short-run private communi- 3an essential argument that many architecture analyses omit cation path that consumes no bandwidth on the main SAFEbus communication media. Second, the second round of exchange sends messages that consist of only a single word instead of fully replicating all of the data from the first round of exchange. Another benefit of hierarchical Byzantine agreement is its ability to tolerate some multiple Byzantine faults while using less hardware than most approaches reported in the literature. In the case of SAFEbus, multiple Byzantine faults can be tolerated as long as no two faults are located within the same pair, produce identical symptoms, and occur at the same time. Given the fault zone independence between the two halves of a pair, the probability of such a dual failure occurring can be calculated and found to be much less than the maximum allowed by system requirements. A SAFEbus system can tolerate any number of other Byzantine faults within pairs4 simply by adding another pair for each fault to be tolerated. The two benefits of hierarchical Byzantine agreement augment each other to give a surprising amount of cost saving. As an example, compare SAFEbus against the two-fault scenario given in Figure 14. In this scenario, SAFEbus would require only three self-checking pairs, two rounds of exchange, transmit bandwidth of 4x+8 bytes, and receive bandwidth of 8x+16 bytes5. For this tremendous savings, SAFEbus gives up only the ability to tolerate two simultaneous identical Byzantine faults within a pair.
Appears in 5 contracts
Samples: ntrs.nasa.gov, ntrs.nasa.gov, ntrs.nasa.gov