Handout 10

Balanced Trees*


Introduction

The term B-Trees is variously interpreted as Balanced Tree, Bayer Tree (for the author of the first definitive paper on the subject), Broad Trees, Brushy Trees or Boeing Trees. We will stick to the term Balanced, but you should remember that Bayer's name is sometimes used.


B Tree Attributes

B Tree of order m


Adding Values

The basic B-Tree consists of a page, containing keys, pointers, and data. There is one key associated with each record. B-Trees grow upwards, based on the need to index separate pages (blocks). When one block is too small to accommodate all records, it is split according to the following rules:

  1. The leftmost values are split to the left
  2. The rightmost are split to the right
  3. The center value is passed upwards

Consider the following set of numbers, inserted in the order in which they are listed:


a) N D W K
b) B
c) P R A
d) X
e) G Q L V;
f) C
g) E T
h) Y
i) U F Z
j) O.





Removing Values

Deletion is a good deal more difficult in cases where `underflow' occurs (i.e., a block ends up with too few keys).

  1. In the case of underflow, the first rule is to attempt to borrow from the left (by convention).
  2. If the first attempt fails by causing underflow in the adjacent block, the two blocks are combined, and the key at the next higher level is adjusted accordingly.
  3. If a top level key is deleted, its closest adjacent key is substituted and the remaining tree is adjusted for underflow.
  4. If necessary, one level is collapsed.
Assume deletion in the following sequence:
a) O
b) Z F U
c) Y
d) T
e) E
f) C
g) V L
h) Q
i) G X
j) A
k) R
l) P
m) B
n) K W D N

Note that the sequence of deletions is the exact opposite of the insertion sequence given above.




Conclusion

B-trees are extremely powerful structures for direct searches for random data. Since the greatest time delays are in disk accesses, a structure that minimizes the search as the file grows large is likely to be very helpful in on-line key searches. In many B-tree structures, because a good many values and pointers can be contained in a block, it is frequently possible to contain quite a large data base in three or four levels of indexing. Consider, for example, a block that had 99 data elements in a block (100 pointers). Three levels of indexing would provide for 1,000,000 values at the lowest level, plus an additional 110,000 at higher levels. It is not unreasonable to assume a structure that can include one hundred or more pointers in a single block.

Access time in a B tree is a logarithmic function of n/2. Experiments with the cost of constructing such files, using either sequential or random insertion shows that, with proper paging and buffering, disk accesses can be kept to a minimum. The use of this structure in a dynamic file is generally excellent.


Variations to the Balanced Tree file structure

The original B-tree was a structure in which each node contained keys, data, and pointers to lower blocks. This is usually referred to as just plain B-Tree structure. It is easiest to design in fixed length keys and records. It will give excellent direct search attributes for retrieving a single variable, but it does not lend itself ideally to the get next problem, because it is not constructed with back pointers, meaning that one has to return to the top of the tree in order to find a next record outside the given block.

This situation can be alleviated slightly by having a tree in which the B-Tree contains only keys and pointers, and the actual data are in `leaves' at the bottom of the tree (see diagram). This structure is sometimes called a B+ Tree. Since the upper part of the tree has only pointers and keys, it is easier to have more pointers in each block, and the height of the tree is compressed Accordingly, searches are a bit faster for this reason. However, get next still may require returning to the entry point.


The next improvement is to have the same structure as a B+ tree, but to add sequential pointers at the leaf level. This structure, often referred to as a B* Tree, is the most efficient of this family for get next searches, and it does not sacrifice anything for direct access searches. The principal price paid is that the sequential pointers between blocks must be maintained. In a dynamic file where blocks are created and destroyed, this updating may be a bit difficult, therefore causing a slight degradation in retrieval performance.

The greatest difficulty with a tree structure is the updating to make sure that the tree remains balanced. In worst cases, these balancing acts can be time-consuming and slow up an interactive system. Because B-Trees are used more and more, we are learning more about their efficient implementation. Buffering is a big help. Marking levels as requiring updating without actually doing the updating is also done often, usually in conjunction with a system ``write demon'' that does updating at intervals determined in part by user load and by the need to clear buffers. Buffers can be very helpful in shared files. A large number of buffers are often used in dedicated Database management systems (sometimes two hundred or more). Use of a Least Recently Used algorithm usually ensures that the head of the tree will remain in memory, thereby speeding up searches.

Storage efficiency is in part determined by the algorithm used to determine when to split a block, and, when it is split, how full to leave each component. Storage efficiency is also affected by the sequence in which a file is created. In some cases, the worst storage efficiency occurs when a file is filled sequentially in key order, since a split block is never revisited.

B-trees are widely used in operating system directories and other files, and in database systems in general. We will not have time to study them in greater detail than this series of presentations, but you should be aware of them, and read at least one of the references given above.

One important use of B-Trees is in the index file structure. In the earlier illustration on index files, we noted that the index files are maintained as normal sequential files. If, however, we maintain them as B-Trees, then the problem of updating is greatly relieved, and the Indexed file becomes one of the most powerful on-line strategies for multiply indexing a reference file. The illustration shows how more than one index can be used to classify a file.



* The source of this page is here.