1, then ax 6 1 mod N, no matter what x might be, and therefore a cannot have a multiplicative inverse modulo N. In fact, this is the only circumstance in which a is not invertible. When gcd(a;N) = 1 (we say a and N are relatively prime), the extended Euclid algorithm gives us integers x and y such that ax+Ny = 1, which means that ax 1 (mod N). Thus x is a’s sought inverse. Example. Continuing with our previous example, suppose we wish to compute 11 1 mod 25. Using the extended Euclid algorithm, we nd that 15 25 34 11 = 1. Reducing both sides modulo 25, we have 34 11 1 mod 25. So 34 16 mod 25 is the inverse of 11 mod 25. Modular division theorem For any a mod N, a has a multiplicative inverse modulo N if and only if it is relatively prime to N. When this inverse exists, it can be found in time O(n3) (where as usualndenotes the number of bits of N) by running the extended Euclid algorithm. This resolves the issue of modular division: when working modulo N, we can divide by numbers relatively prime to N and only by these. And to actually carry out the division, we multiply by the inverse. Is your social security number a prime? The numbers 7;17;19;71; and 79 are primes, but how about 717-19-7179? Telling whether a reasonably large number is a prime seems tedious because there are far too many candidate factors to try. However, there are some clever tricks to speed up the process. For instance, you can omit even-valued candidates after you have eliminated the number 2. You can actually omit all candidates except those that are themselves primes. In fact, a little further thought will convince you that you can proclaim N a prime as soon as you have rejected all candidates up to pN, for if N can indeed be factored as N = K L, then it is impossible for both factors to exceed pN. We seem to be making progress! Perhaps by omitting more and more candidate factors, a truly ef cient primality test can be discovered. Unfortunately, there is no fast primality test down this road. The reason is that we have been trying to tell if a number is a prime by factoring it. And factoring is a hard problem! Modern cryptography, as well as the balance of this chapter, is about the following im- portant idea: factoring is hard and primality is easy. We cannot factor large numbers, but we can easily test huge numbers for primality! (Presumably, if a number is composite, such a test will detect this without nding a factor.) 1.3 Primality testing Is there some litmus test that will tell us whether a number is prime without actually trying to factor the number? We place our hopes in a theorem from the year 1640. 34 Algorithms Fermat’s little theorem If p is prime, then for every 1 a 10 where the exponentiation can be performed using fewer multiplications, by some other method. 1.17. Consider the problem of computing xy for given integers x and y: we want the whole answer, not modulo a third integer. We know two algorithms for doing this: the iterative algorithm which performs y 1 multiplications by x; and the recursive algorithm based on the binary expansion of y. Compare the time requirements of these two algorithms, assuming that the time to multiply an n-bit number by an m-bit number is O(mn). 1.18. Compute gcd(210;588) two different ways: by nding the factorization of each number, and by using Euclid’s algorithm. 1.19. The Fibonacci numbers F0;F1;::: are given by the recurrence Fn+1 = Fn +Fn 1, F0 = 0, F1 = 1. Show that for any n 1, gcd(Fn+1;Fn) = 1. 1.20. Find the inverse of: 20 mod 79; 3 mod 62; 21 mod 91; 5 mod 23. 1.21. How many integers modulo 113 have inverses? (Note: 113 = 1331.) 1.22. Prove or disprove: If a has an inverse modulo b, then b has an inverse modulo a. 1.23. Show that if a has a multiplicative inverse modulo N, then this inverse is unique (modulo N). 1.24. If p is prime, how many elements off0;1;:::;pn 1ghave an inverse modulo pn? 1.25. Calculate 2125 mod 127 using any method you choose. (Hint: 127 is prime.) 1.26. What is the least signi cant decimal digit of 171717? (Hint: For distinct primesp;q, and any a6 0 (mod pq), we proved the formula a(p 1)(q 1) 1 (mod pq) in Section 1.4.2.) 1.27. Consider an RSA key set with p = 17;q = 23;N = 391, and e = 3 (as in Figure 1.9). What value of d should be used for the secret key? What is the encryption of the message M = 41? 1.28. In an RSA cryptosystem, p = 7 and q = 11 (as in Figure 1.9). Find appropriate exponents d and e. 1.29. Let [m] denote the setf0;1;:::;m 1g. For each of the following families of hash functions, say whether or not it is universal, and determine how many random bits are needed to choose a function from the family. (a) H =fha1;a2 : a1;a2 2[m]g, where m is a xed prime and ha1;a2(x1;x2) = a1x1 +a2x2 mod m: Notice that each of these functions has signature ha1;a2 : [m]2 ![m], that is, it maps a pair of integers in [m] to a single integer in [m]. (b) H is as before, except that now m = 2k is some xed power of 2. (c) H is the set of all functions f : [m]![m 1]. 50 Algorithms 1.30. The grade-school algorithm for multiplying two n-bit binary numbers x and y consists of adding together n copies of x, each appropriately left-shifted. Each copy, when shifted, is at most 2n bits long. In this problem, we will examine a scheme for adding n binary numbers, each m bits long, using a circuit or a parallel architecture. The main parameter of interest in this question is therefore the depth of the circuit or the longest path from the input to the output of the circuit. This determines the total time taken for computing the function. To add twom-bit binary numbers naively, we must wait for the carry bit from position i 1 before we can gure out the ith bit of the answer. This leads to a circuit of depth O(m). However carry lookahead circuits (see wikipedia.com if you want to know more about this) can add in O(logm) depth. (a) Assuming you have carry lookahead circuits for addition, show how to add n numbers each m bits long using a circuit of depth O((logn)(logm)). (b) When adding three m-bit binary numbers x+y+z, there is a trick we can use to parallelize the process. Instead of carrying out the addition completely, we can re-express the result as the sum of just two binary numbers r +s, such that the ith bits of r and s can be computed independently of the other bits. Show how this can be done. (Hint: One of the numbers represents carry bits.) (c) Show how to use the trick from the previous part to design a circuit of depth O(logn) for multiplying two n-bit numbers. 1.31. Consider the problem of computing N! = 1 2 3 N. (a) If N is an n-bit number, how many bits long is N!, approximately (in ( ) form)? (b) Give an algorithm to compute N! and analyze its running time. 1.32. A positive integer N is a power if it is of the form qk, where q;k are positive integers and k> 1. (a) Give an ef cient algorithm that takes as input a number N and determines whether it is a square, that is, whether it can be written as q2 for some positive integer q. What is the running time of your algorithm? (b) Show that if N = qk (with N, q, and k all positive integers), then either k logN or N = 1. (c) Give an ef cient algorithm for determining whether a positive integerN is a power. Analyze its running time. 1.33. Give an ef cient algorithm to compute the least common multiple of two n-bit numbers x and y, that is, the smallest number divisible by both x and y. What is the running time of your algorithm as a function of n? 1.34. On page 38, we claimed that since about a 1=n fraction of n-bit numbers are prime, on average it is suf cient to draw O(n) random n-bit numbers before hitting a prime. We now justify this rigorously. Suppose a particular coin has a probability p of coming up heads. How many times must you toss it, on average, before it comes up heads? (Hint: Method 1: start by showing that the correct expression is P1i=1 i(1 p)i 1p. Method 2: if E is the average number of coin tosses, show that E = 1 + (1 p)E.) 1.35. Wilson’s theorem says that a number N is prime if and only if (N 1)! 1 (mod N): S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 51 (a) If p is prime, then we know every number 1 x

0 (in the multiplication algorithm, a = 3, b = 2, and d = 1). Their running time can therefore be captured by the equation T(n) = aT(dn=be) + O(nd). We next derive a closed-form solution to this general recurrence so that we no longer have to solve it explicitly in each new instance. Master theorem2 If T(n) = aT(dn=be) + O(nd) for some constants a > 0, b > 1, and d 0, 2There are even more general results of this type, but we will not be needing them. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 59 Figure 2.3 Each problem of size n is divided into a subproblems of size n=b. Size 1 Size n=b2 Size n=b Size n Depth logbn Width alogb n = nlogb a Branching factor a then T(n) = 8 < : O(nd) if d> logba O(nd logn) if d = logba O(nlogb a) if d< logba . This single theorem tells us the running times of most of the divide-and-conquer procedures we are likely to use. Proof. To prove the claim, let’s start by assuming for the sake of convenience that n is a power of b. This will not in uence the nal bound in any important way after all, n is at most a multiplicative factor of b away from some power of b (Exercise 2.2) and it will allow us to ignore the rounding effect indn=be. Next, notice that the size of the subproblems decreases by a factor of b with each level of recursion, and therefore reaches the base case after logbn levels. This is the height of the recursion tree. Its branching factor is a, so the kth level of the tree is made up of ak subproblems, each of size n=bk (Figure 2.3). The total work done at this level is ak O n bk d = O(nd) a bd k : As k goes from 0 (the root) to logbn (the leaves), these numbers form a geometric series with 60 Algorithms ratio a=bd. Finding the sum of such a series in big-O notation is easy (Exercise 0.2), and comes down to three cases. 1. The ratio is less than 1. Then the series is decreasing, and its sum is just given by its rst term, O(nd). 2. The ratio is greater than 1. The series is increasing and its sum is given by its last term, O(nlogb a): nd a bd logb n = nd alog b n (blogb n)d = alogb n = a(loga n)(logb a) = nlogb a: 3. The ratio is exactly 1. In this case all O(logn) terms of the series are equal to O(nd). These cases translate directly into the three contingencies in the theorem statement. Binary search The ultimate divide-and-conquer algorithm is, of course, binary search: to nd a key k in a large le containing keys z[0;1;:::;n 1] in sorted order, we rst compare k with z[n=2], and depending on the result we recurse either on the rst half of the le, z[0;:::;n=2 1], or on the second half, z[n=2;:::;n 1]. The recurrence now is T(n) = T(dn=2e)+O(1), which is the case a = 1;b = 2;d = 0. Plugging into our master theorem we get the familiar solution: a running time of just O(logn). 2.3 Mergesort The problem of sorting a list of numbers lends itself immediately to a divide-and-conquer strategy: split the list into two halves, recursively sort each half, and then merge the two sorted sublists. function mergesort(a[1:::n]) Input: An array of numbers a[1:::n] Output: A sorted version of this array if n> 1: return merge(mergesort(a[1:::bn=2c]), mergesort(a[bn=2c+ 1:::n])) else: return a The correctness of this algorithm is self-evident, as long as a correct merge subroutine is speci ed. If we are given two sorted arrays x[1:::k] and y[1:::l], how do we ef ciently merge them into a single sorted array z[1:::k +l]? Well, the very rst element of z is either x[1] or y[1], whichever is smaller. The rest of z[ ] can then be constructed recursively. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 61 Figure 2.4 The sequence of merge operations in mergesort. 2 3 10 1 6 7 135 102 53 137 1 6 2 5 3 7 13 1 610 Input: 10 2 3 1135 7 6 1 6 10 1332 5 7 . function merge(x[1:::k];y[1:::l]) if k = 0: return y[1:::l] if l = 0: return x[1:::k] if x[1] y[1]: return x[1] merge(x[2:::k];y[1:::l]) else: return y[1] merge(x[1:::k];y[2:::l]) Here denotes concatenation. This merge procedure does a constant amount of work per recursive call (provided the required array space is allocated in advance), for a total running time of O(k +l). Thus merge’s are linear, and the overall time taken by mergesort is T(n) = 2T(n=2) +O(n); or O(nlogn). Looking back at the mergesort algorithm, we see that all the real work is done in merg- ing, which doesn’t start until the recursion gets down to singleton arrays. The singletons are merged in pairs, to yield arrays with two elements. Then pairs of these 2-tuples are merged, producing 4-tuples, and so on. Figure 2.4 shows an example. This viewpoint also suggests how mergesort might be made iterative. At any given mo- ment, there is a set of active arrays initially, the singletons which are merged in pairs to give the next batch of active arrays. These arrays can be organized in a queue, and processed by repeatedly removing two arrays from the front of the queue, merging them, and putting the result at the end of the queue. 62 Algorithms In the following pseudocode, the primitive operation inject adds an element to the end of the queue while eject removes and returns the element at the front of the queue. function iterative-mergesort(a[1:::n]) Input: elements a1;a2;:::;an to be sorted Q = [ ] (empty queue) for i = 1 to n: inject(Q;[ai]) while jQj> 1: inject(Q;merge(eject(Q);eject(Q))) return eject(Q) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 63 An n log n lower bound for sorting Sorting algorithms can be depicted as trees. The one in the following gure sorts an array of three elements, a1;a2;a3. It starts by comparing a1 to a2 and, if the rst is larger, compares it with a3; otherwise it compares a2 and a3. And so on. Eventually we end up at a leaf, and this leaf is labeled with the true order of the three elements as a permutation of 1;2;3. For example, if a2 0. There are many ways to see this. The easiest is to notice that n! (n=2)(n=2) because n! = 1 2 n contains at least n=2 factors larger than n=2; and to then take logs of both sides. Another is to recall Stirling’s formula n! s 2n+ 13 nn e n: Either way, we have established that any comparison tree that sorts n elements must make, in the worst case, (nlogn) comparisons, and hence mergesort is optimal! Well, there is some ne print: this neat argument applies only to algorithms that use comparisons. Is it conceivable that there are alternative sorting strategies, perhaps using sophisticated numerical manipulations, that work in linear time? The answer is yes, under certain exceptional circumstances: the canonical such example is when the elements to be sorted are integers that lie in a small range (Exercise 2.20). 64 Algorithms 2.4 Medians The median of a list of numbers is its 50th percentile: half the numbers are bigger than it, and half are smaller. For instance, the median of [45;1;10;30;25] is 25, since this is the middle element when the numbers are arranged in order. If the list has even length, there are two choices for what the middle element could be, in which case we pick the smaller of the two, say. The purpose of the median is to summarize a set of numbers by a single, typical value. The mean, or average, is also very commonly used for this, but the median is in a sense more typical of the data: it is always one of the data values, unlike the mean, and it is less sensitive to outliers. For instance, the median of a list of a hundred 1’s is (rightly) 1, as is the mean. However, if just one of these numbers gets accidentally corrupted to 10;000, the mean shoots up above 100, while the median is unaffected. Computing the median of n numbers is easy: just sort them. The drawback is that this takes O(nlogn) time, whereas we would ideally like something linear. We have reason to be hopeful, because sorting is doing far more work than we really need we just want the middle element and don’t care about the relative ordering of the rest of them. When looking for a recursive solution, it is paradoxically often easier to work with a more general version of the problem for the simple reason that this gives a more powerful step to recurse upon. In our case, the generalization we will consider is selection. SELECTION Input: A list of numbers S; an integer k Output: The kth smallest element of S For instance, if k = 1, the minimum of S is sought, whereas if k =bjSj=2c, it is the median. A randomized divide-and-conquer algorithm for selection Here’s a divide-and-conquer approach to selection. For any number v, imagine splitting list S into three categories: elements smaller than v, those equal to v (there might be duplicates), and those greater than v. Call these SL, Sv, and SR respectively. For instance, if the array S : 2 36 5 21 8 13 11 20 5 4 1 is split on v = 5, the three subarrays generated are SL : 2 4 1 Sv : 5 5 SR : 36 21 8 13 11 20 The search can instantly be narrowed down to one of these sublists. If we want, say, the eighth-smallest element of S, we know it must be the third-smallest element of SR since jSLj+jSvj = 5. That is, selection(S;8) = selection(SR;3). More generally, by checking k against the sizes of the subarrays, we can quickly determine which of them holds the desired element: selection(S;k) = 8< : selection(SL;k) if k jSLj v ifjSLjjSLj+jSvj: S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 65 The three sublists SL;Sv, and SR can be computed from S in linear time; in fact, this compu- tation can even be done in place, that is, without allocating new memory (Exercise 2.15). We then recurse on the appropriate sublist. The effect of the split is thus to shrink the number of elements fromjSjto at most maxfjSLj;jSRjg. Our divide-and-conquer algorithm for selection is now fully speci ed, except for the crucial detail of how to choose v. It should be picked quickly, and it should shrink the array substan- tially, the ideal situation beingjSLj;jSRj 12jSj. If we could always guarantee this situation, we would get a running time of T(n) = T(n=2) +O(n); which is linear as desired. But this requires picking v to be the median, which is our ultimate goal! Instead, we follow a much simpler alternative: we pick v randomly from S. Ef ciency analysis Naturally, the running time of our algorithm depends on the random choices of v. It is possible that due to persistent bad luck we keep picking v to be the largest element of the array (or the smallest element), and thereby shrink the array by only one element each time. In the earlier example, we might rst pick v = 36, then v = 21, and so on. This worst-case scenario would force our selection algorithm to perform n+ (n 1) + (n 2) + + n2 = (n2) operations (when computing the median), but it is extremely unlikely to occur. Equally un- likely is the best possible case we discussed before, in which each randomly chosen v just happens to split the array perfectly in half, resulting in a running time of O(n). Where, in this spectrum from O(n) to (n2), does the average running time lie? Fortunately, it lies very close to the best-case time. To distinguish between lucky and unlucky choices of v, we will call v good if it lies within the 25th to 75th percentile of the array that it is chosen from. We like these choices of v because they ensure that the sublists SL and SR have size at most three-fourths that of S (do you see why?), so that the array shrinks substantially. Fortunately, good v’s are abundant: half the elements of any list must fall between the 25th to 75th percentile! Given that a randomly chosen v has a 50% chance of being good, how many v’s do we need to pick on average before getting a good one? Here’s a more familiar reformulation (see also Exercise 1.34): Lemma On average a fair coin needs to be tossed two times before a heads is seen. Proof. Let E be the expected number of tosses before a heads is seen. We certainly need at least one toss, and if it’s heads, we’re done. If it’s tails (which occurs with probability 1=2), we need to repeat. Hence E = 1 + 12E, which works out to E = 2. 66 Algorithms Therefore, after two split operations on average, the array will shrink to at most three- fourths of its size. Letting T(n) be the expected running time on an array of size n, we get T(n) T(3n=4) +O(n): This follows by taking expected values of both sides of the following statement: Time taken on an array of size n (time taken on an array of size 3n=4) + (time to reduce array size to 3n=4); and, for the right-hand side, using the familiar property that the expectation of the sum is the sum of the expectations. From this recurrence we conclude that T(n) = O(n): on any input, our algorithm returns the correct answer after a linear number of steps, on the average. The Unix sort command Comparing the algorithms for sorting and median- nding we notice that, beyond the com- mon divide-and-conquer philosophy and structure, they are exact opposites. Mergesort splits the array in two in the most convenient way ( rst half, second half), without any regard to the magnitudes of the elements in each half; but then it works hard to put the sorted sub- arrays together. In contrast, the median algorithm is careful about its splitting (smaller numbers rst, then the larger ones), but its work ends with the recursive call. Quicksort is a sorting algorithm that splits the array in exactly the same way as the me- dian algorithm; and once the subarrays are sorted, by two recursive calls, there is nothing more to do. Its worst-case performance is (n2), like that of median- nding. But it can be proved (Exercise 2.24) that its average case is O(nlogn); furthermore, empirically it outper- forms other sorting algorithms. This has made quicksort a favorite in many applications for instance, it is the basis of the code by which really enormous les are sorted. 2.5 Matrix multiplication The product of two n n matrices X and Y is a third n n matrix Z = XY, with (i;j)th entry Zij = nX k=1 XikYkj: To make it more visual, Zij is the dot product of the ith row of X with the jth column of Y: X Y Z i j (i;j) = S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 67 In general, XY is not the same as YX; matrix multiplication is not commutative. The preceding formula implies an O(n3) algorithm for matrix multiplication: there are n2 entries to be computed, and each takes O(n) time. For quite a while, this was widely believed to be the best running time possible, and it was even proved that in certain models of com- putation no algorithm could do better. It was therefore a source of great excitement when in 1969, the German mathematician Volker Strassen announced a signi cantly more ef cient algorithm, based upon divide-and-conquer. Matrix multiplication is particularly easy to break into subproblems, because it can be performed blockwise. To see what this means, carve X into four n=2 n=2 blocks, and also Y: X = A B C D ; Y = E F G H : Then their product can be expressed in terms of these blocks and is exactly as if the blocks were single elements (Exercise 2.11). XY = A B C D E F G H = AE +BG AF +BH CE +DG CF +DH We now have a divide-and-conquer strategy: to compute the size-nproductXY, recursively compute eight size-n=2 products AE;BG;AF;BH;CE;DG;CF;DH, and then do a few O(n2)- time additions. The total running time is described by the recurrence relation T(n) = 8T(n=2) +O(n2): This comes out to an unimpressive O(n3), the same as for the default algorithm. But the ef ciency can be further improved, and as with integer multiplication, the key is some clever algebra. It turns out XY can be computed from just seven n=2 n=2 subproblems, via a decomposition so tricky and intricate that one wonders how Strassen was ever able to discover it! XY = P 5 +P4 P2 +P6 P1 +P2 P3 +P4 P1 +P5 P3 P7 where P1 = A(F H) P2 = (A+B)H P3 = (C +D)E P4 = D(G E) P5 = (A+D)(E +H) P6 = (B D)(G+H) P7 = (A C)(E +F) The new running time is T(n) = 7T(n=2) +O(n2); which by the master theorem works out to O(nlog2 7) O(n2:81). 68 Algorithms 2.6 The fast Fourier transform We have so far seen how divide-and-conquer gives fast algorithms for multiplying integers and matrices; our next target is polynomials. The product of two degree-d polynomials is a polynomial of degree 2d, for example: (1 + 2x+ 3x2) (2 +x+ 4x2) = 2 + 5x+ 12x2 + 11x3 + 12x4: More generally, if A(x) = a0 +a1x+ +adxd and B(x) = b0 +b1x+ +bdxd, their product C(x) = A(x) B(x) = c0 +c1x+ +c2dx2d has coef cients ck = a0bk +a1bk 1 + +akb0 = kX i=0 aibk i (for i > d, take ai and bi to be zero). Computing ck from this formula takes O(k) steps, and nding all 2d + 1 coef cients would therefore seem to require (d2) time. Can we possibly multiply polynomials faster than this? The solution we will develop, the fast Fourier transform, has revolutionized indeed, de ned the eld of signal processing (see the following box). Because of its huge impor- tance, and its wealth of insights from different elds of study, we will approach it a little more leisurely than usual. The reader who wants just the core algorithm can skip directly to Section 2.6.4. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 69 Why multiply polynomials? For one thing, it turns out that the fastest algorithms we have for multiplying integers rely heavily on polynomial multiplication; after all, polynomials and binary integers are quite similar just replace the variable x by the base 2, and watch out for carries. But perhaps more importantly, multiplying polynomials is crucial for signal processing. A signal is any quantity that is a function of time (as in Figure (a)) or of position. It might, for instance, capture a human voice by measuring uctuations in air pressure close to the speaker’s mouth, or alternatively, the pattern of stars in the night sky, by measuring brightness as a function of angle. a(t) t a0a1a0a2 a3a1a3a4 a5a1a5a6a1a6 a7a1a7a8a9a10a11a12 a13a14 a15a1a15 a15a1a15a16 a16 a17a1a17a18a1a18 a19a1a19a20a21a22 a23a1a23a24a1a24a25a26 a27a1a27a28 a29a1a29a30a1a30a31a1a31a32a1a32 a33a1a33a34 a35a1a35a36a1a36 a37a1a37a38 a39a1a39a40a1a40a41a42 a43a1a43a44 a45a1a45a46 a47a1a47a48a1a48 a49a1a49a50a1a50a51a1a51a52a53a1a53 a54a1a54a55a56 a57a1a57a58a1a58 a59a1a59a60a1a60 a61a1a61a62 a63a1a63a64 a(t) t a65a66a67a68a69a70a71a72 a73a74a75a76a77a78a79a80a81a82a83a84a83a85a84a85a86a87 a88a89a90a91a92a93a94a95a96a97a98a99a100a101a102a84a102a103a104a105a106a107a108a84a108a109a110a111 a112a113 t (t) (a) (b) (c) In order to extract information from a signal, we need to rst digitize it by sampling (Figure (b)) and, then, to put it through a system that will transform it in some way. The output is called the response of the system: signal ! SYSTEM ! response An important class of systems are those that are linear the response to the sum of two signals is just the sum of their individual responses and time invariant shifting the input signal by time t produces the same output, also shifted by t. Any system with these prop- erties is completely characterized by its response to the simplest possible input signal: the unit impulse (t), consisting solely of a jerk at t = 0 (Figure (c)). To see this, rst consider the close relative (t i), a shifted impulse in which the jerk occurs at time i. Any signal a(t) can be expressed as a linear combination of these, letting (t i) pick out its behavior at time i, a(t) = T 1X i=0 a(i) (t i) (if the signal consists of T samples). By linearity, the system response to input a(t) is deter- mined by the responses to the various (t i). And by time invariance, these are in turn just shifted copies of the impulse response b(t), the response to (t). In other words, the output of the system at time k is c(k) = kX i=0 a(i)b(k i); exactly the formula for polynomial multiplication! 70 Algorithms 2.6.1 An alternative representation of polynomials To arrive at a fast algorithm for polynomial multiplication we take inspiration from an impor- tant property of polynomials. Fact A degree-d polynomial is uniquely characterized by its values at any d + 1 distinct points. A familiar instance of this is that any two points determine a line. We will later see why the more general statement is true (page 76), but for the time being it gives us an alternative representation of polynomials. Fix any distinct points x0;:::;xd. We can specify a degree-d polynomial A(x) = a0 +a1x+ +adxd by either one of the following: 1. Its coef cients a0;a1;:::;ad 2. The values A(x0);A(x1);:::;A(xd) Of these two representations, the second is the more attractive for polynomial multiplication. Since the product C(x) has degree 2d, it is completely determined by its value at any 2d + 1 points. And its value at any given point z is easy enough to gure out, just A(z) times B(z). Thus polynomial multiplication takes linear time in the value representation. The problem is that we expect the input polynomials, and also their product, to be speci ed by coef cients. So we need to rst translate from coef cients to values which is just a matter of evaluating the polynomial at the chosen points then multiply in the value representation, and nally translate back to coef cients, a process called interpolation. Interpolation Coef cient representation a0; a1; : : :; ad Value representation A(x0); A(x1); : : :; A(xd) Evaluation Figure 2.5 presents the resulting algorithm. The equivalence of the two polynomial representations makes it clear that this high-level approach is correct, but how ef cient is it? Certainly the selection step and the n multiplica- tions are no trouble at all, just linear time.3 But (leaving aside interpolation, about which we know even less) how about evaluation? Evaluating a polynomial of degree d n at a single point takes O(n) steps (Exercise 2.29), and so the baseline for n points is (n2). We’ll now see that the fast Fourier transform (FFT) does it in just O(nlogn) time, for a particularly clever choice of x0;:::;xn 1 in which the computations required by the individual points overlap with one another and can be shared. 3In a typical setting for polynomial multiplication, the coef cients of the polynomials are real numbers and, moreover, are small enough that basic arithmetic operations (adding and multiplying) take unit time. We will assume this to be the case without any great loss of generality; in particular, the time bounds we obtain are easily adjustable to situations with larger numbers. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 71 Figure 2.5 Polynomial multiplication Input: Coefficients of two polynomials, A(x) and B(x), of degree d Output: Their product C = A B Selection Pick some points x0;x1;:::;xn 1, where n 2d+ 1 Evaluation Compute A(x0);A(x1);:::;A(xn 1) and B(x0);B(x1);:::;B(xn 1) Multiplication Compute C(xk) = A(xk)B(xk) for all k = 0;:::;n 1 Interpolation Recover C(x) = c0 +c1x+ +c2dx2d 2.6.2 Evaluation by divide-and-conquer Here’s an idea for how to pick the n points at which to evaluate a polynomial A(x) of degree n 1. If we choose them to be positive-negative pairs, that is, x0; x1;:::; xn=2 1; then the computations required for each A(xi) and A( xi) overlap a lot, because the even powers of xi coincide with those of xi. To investigate this, we need to split A(x) into its odd and even powers, for instance 3 + 4x+ 6x2 + 2x3 +x4 + 10x5 = (3 + 6x2 +x4) +x(4 + 2x2 + 10x4): Notice that the terms in parentheses are polynomials in x2. More generally, A(x) = Ae(x2) +xAo(x2); where Ae( ), with the even-numbered coef cients, and Ao( ), with the odd-numbered coef - cients, are polynomials of degree n=2 1 (assume for convenience that n is even). Given paired points xi, the calculations needed for A(xi) can be recycled toward computing A( xi): A(xi) = Ae(x2i ) +xiAo(x2i ) A( xi) = Ae(x2i ) xiAo(x2i ): In other words, evaluating A(x) at n paired points x0;:::; xn=2 1 reduces to evaluating Ae(x) and Ao(x) (which each have half the degree of A(x)) at just n=2 points, x20;:::;x2n=2 1. 72 Algorithms Evaluate: A(x)degree n 1 Ae(x) and Ao(x) degree n=2 1 at: at: x0 +x1 x1 x20 xn=2 1+xn=2 1 x21 x2n=2 1 +x0 Equivalently, evaluate: The original problem of size n is in this way recast as two subproblems of size n=2, followed by some linear-time arithmetic. If we could recurse, we would get a divide-and-conquer pro- cedure with running time T(n) = 2T(n=2) +O(n); which is O(nlogn), exactly what we want. But we have a problem: The plus-minus trick only works at the top level of the recur- sion. To recurse at the next level, we need the n=2 evaluation points x20;x21;:::;x2n=2 1 to be themselves plus-minus pairs. But how can a square be negative? The task seems impossible! Unless, of course, we use complex numbers. Fine, but which complex numbers? To gure this out, let us reverse engineer the process. At the very bottom of the recursion, we have a single point. This point might as well be 1, in which case the level above it must consist of its square roots, p1 = 1. 1 i 1 +1 +1 +i+1 .. . The next level up then has p+1 = 1 as well as the complex numbers p 1 = i, where i is the imaginary unit. By continuing in this manner, we eventually reach the initial set of n points. Perhaps you have already guessed what they are: the complex nth roots of unity, that is, the n complex solutions to the equation zn = 1. Figure 2.6 is a pictorial review of some basic facts about complex numbers. The third panel of this gure introduces the nth roots of unity: the complex numbers 1;!;!2;:::;!n 1, where ! = e2 i=n. If n is even, 1. The nth roots are plus-minus paired, !n=2+j = !j. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 73 2. Squaring them produces the (n=2)nd roots of unity. Therefore, if we start with these numbers for some n that is a power of 2, then at successive levels of recursion we will have the (n=2k)th roots of unity, for k = 0;1;2;3;:::. All these sets of numbers are plus-minus paired, and so our divide-and-conquer, as shown in the last panel, works perfectly. The resulting algorithm is the fast Fourier transform (Figure 2.7). 74 Algorithms Figure 2.6 The complex roots of unity are ideal for our divide-and-conquer scheme. Real Imaginary a b r The complex plane z = a+bi is plotted at position (a;b). Polar coordinates: rewrite as z = r(cos + isin ) = rei , denoted (r; ). length r =pa2 +b2. angle 2[0;2 ): cos = a=r;sin = b=r. can always be reduced modulo 2 . Examples: Number 1 i 5 + 5iPolar coords (1; ) (1; =2) (5p2; =4) (r1r2; 1 + 2) (r1; 1) (r2; 2) Multiplying is easy in polar coordinates Multiply the lengths and add the angles: (r1; 1) (r2; 2) = (r1r2; 1 + 2). For any z = (r; ), z = (r; + ) since 1 = (1; ). If z is on the unit circle (i.e., r = 1), then zn = (1;n ). Angle 2 n 4 n 2 n + The nth complex roots of unity Solutions to the equation zn = 1. By the multiplication rule: solutions are z = (1; ), for a multiple of 2 =n (shown here for n = 16). For even n: These numbers are plus-minus paired: (1; ) = (1; + ). Their squares are the (n=2)nd roots of unity, shown here with boxes around them. Divide-and-conquer step Evaluate Ae(x); Ao(x) at (n=2)nd roots Stillpaired Divide and conquer Paired Evaluate A(x) at nth roots of unity (n is a power of 2) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 75 Figure 2.7 The fast Fourier transform (polynomial formulation) function FFT(A;!) Input: Coefficient representation of a polynomial A(x) of degree n 1, where n is a power of 2 !, an nth root of unity Output: Value representation A(!0);:::;A(!n 1) if ! = 1: return A(1) express A(x) in the form Ae(x2) +xAo(x2) call FFT(Ae;!2) to evaluate Ae at even powers of ! call FFT(Ao;!2) to evaluate Ao at even powers of ! for j = 0 to n 1: compute A(!j) = Ae(!2j) +!jAo(!2j) return A(!0);:::;A(!n 1) 2.6.3 Interpolation Let’s take stock of where we are. We rst developed a high-level scheme for multiplying polynomials (Figure 2.5), based on the observation that polynomials can be represented in two ways, in terms of their coef cients or in terms of their values at a selected set of points. Interpolation Coef cient representation a0; a1; : : :; an 1 Value representation A(x0); A(x1); : : :; A(xn 1) Evaluation The value representation makes it trivial to multiply polynomials, but we cannot ignore the coef cient representation since it is the form in which the input and output of our overall algorithm are speci ed. So we designed the FFT, a way to move from coef cients to values in time just O(nlogn), when the pointsfxigare complex nth roots of unity (1;!;!2;:::;!n 1). hvaluesi = FFT(hcoef cientsi;!): The last remaining piece of the puzzle is the inverse operation, interpolation. It will turn out, amazingly, that hcoef cientsi = 1n FFT(hvaluesi;! 1): Interpolation is thus solved in the most simple and elegant way we could possibly have hoped for using the same FFT algorithm, but called with ! 1 in place of !! This might seem like a miraculous coincidence, but it will make a lot more sense when we recast our polynomial oper- ations in the language of linear algebra. Meanwhile, our O(nlogn) polynomial multiplication algorithm (Figure 2.5) is now fully speci ed. 76 Algorithms A matrix reformulation To get a clearer view of interpolation, let’s explicitly set down the relationship between our two representations for a polynomial A(x) of degree n 1. They are both vectors of n numbers, and one is a linear transformation of the other: 2 66 64 A(x0) A(x1) ... A(xn 1) 3 77 75 = 2 66 64 1 x0 x20 xn 10 1 x1 x21 xn 11 ... 1 xn 1 x2n 1 xn 1n 1 3 77 75 2 66 64 a0 a1 ... an 1 3 77 75: Call the matrix in the middle M. Its specialized format a Vandermonde matrix gives it many remarkable properties, of which the following is particularly relevant to us. If x0;:::;xn 1 are distinct numbers, then M is invertible. The existence of M 1 allows us to invert the preceding matrix equation so as to express coef- cients in terms of values. In brief, Evaluation is multiplication by M, while interpolation is multiplication by M 1. This reformulation of our polynomial operations reveals their essential nature more clearly. Among other things, it nally justi es an assumption we have been making throughout, that A(x) is uniquely characterized by its values at any n points in fact, we now have an explicit formula that will give us the coef cients of A(x) in this situation. Vandermonde matrices also have the distinction of being quicker to invert than more general matrices, in O(n2) time in- stead of O(n3). However, using this for interpolation would still not be fast enough for us, so once again we turn to our special choice of points the complex roots of unity. Interpolation resolved In linear algebra terms, the FFT multiplies an arbitrary n-dimensional vector which we have been calling the coef cient representation by the n n matrix Mn(!) = 2 66 66 66 66 66 4 1 1 1 1 1 ! !2 !n 1 1 !2 !4 !2(n 1) ... 1 !j !2j !(n 1)j ... 1 !(n 1) !2(n 1) !(n 1)(n 1) 3 77 77 77 77 77 5 row for !0 = 1 ! !2 ... !j ... !n 1 where ! is a complex nth root of unity, and n is a power of 2. Notice how simple this matrix is to describe: its (j;k)th entry (starting row- and column-count at zero) is !jk. Multiplication byM = Mn(!) maps thekth coordinate axis (the vector with all zeros except for a 1 at position k) onto the kth column of M. Now here’s the crucial observation, which we’ll S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 77 Figure 2.8 The FFT takes points in the standard coordinate system, whose axes are shown here as x1;x2;x3, and rotates them into the Fourier basis, whose axes are the columns of Mn(!), shown here as f1;f2;f3. For instance, points in direction x1 get mapped into direction f1. FFT x1 x3 x2 f3 f1 f2 prove shortly: the columns of M are orthogonal (at right angles) to each other. Therefore they can be thought of as the axes of an alternative coordinate system, which is often called the Fourier basis. The effect of multiplying a vector by M is to rotate it from the standard basis, with the usual set of axes, into the Fourier basis, which is de ned by the columns of M (Figure 2.8). The FFT is thus a change of basis, a rigid rotation. The inverse of M is the opposite rotation, from the Fourier basis back into the standard basis. When we write out the orthogonality condition precisely, we will be able to read off this inverse transformation with ease: Inversion formula Mn(!) 1 = 1nMn(! 1). But ! 1 is also an nth root of unity, and so interpolation or equivalently, multiplication by Mn(!) 1 is itself just an FFT operation, but with ! replaced by ! 1. Now let’s get into the details. Take ! to be e2 i=n for convenience, and think of the columns of M as vectors in Cn. Recall that the angle between two vectors u = (u0;:::;un 1) and v = (v0;:::;vn 1) in Cn is just a scaling factor times their inner product u v = u0v 0 +u1v 1 + +un 1v n 1; where z denotes the complex conjugate4 of z. This quantity is maximized when the vectors lie in the same direction and is zero when the vectors are orthogonal to each other. The fundamental observation we need is the following. Lemma The columns of matrix M are orthogonal to each other. Proof. Take the inner product of any columns j and k of matrix M, 1 +!j k +!2(j k) + +!(n 1)(j k): 4The complex conjugate of a complex number z = rei is z = re i . The complex conjugate of a vector (or matrix) is obtained by taking the complex conjugates of all its entries. 78 Algorithms This is a geometric series with rst term 1, last term !(n 1)(j k), and ratio !(j k). Therefore it evaluates to (1 !n(j k))=(1 !(j k)), which is 0 except when j = k, in which case all terms are 1 and the sum is n. The orthogonality property can be summarized in the single equation MM = nI; since (MM )ij is the inner product of the ith and jth columns of M (do you see why?). This immediately implies M 1 = (1=n)M : we have an inversion formula! But is it the same for- mula we earlier claimed? Let’s see the (j;k)th entry of M is the complex conjugate of the corresponding entry of M, in other words ! jk. Whereupon M = Mn(! 1), and we’re done. And now we can nally step back and view the whole affair geometrically. The task we need to perform, polynomial multiplication, is a lot easier in the Fourier basis than in the standard basis. Therefore, we rst rotate vectors into the Fourier basis (evaluation), then perform the task (multiplication), and nally rotate back (interpolation). The initial vectors are coef cient representations, while their rotated counterparts are value representations. To ef ciently switch between these, back and forth, is the province of the FFT. 2.6.4 A closer look at the fast Fourier transform Now that our ef cient scheme for polynomial multiplication is fully realized, let’s hone in more closely on the core subroutine that makes it all possible, the fast Fourier transform. The de nitive FFT algorithm The FFT takes as input a vector a = (a0;:::;an 1) and a complex number ! whose powers 1;!;!2;:::;!n 1 are the complex nth roots of unity. It multiplies vector a by the n n matrix Mn(!), which has (j;k)th entry (starting row- and column-count at zero) !jk. The potential for using divide-and-conquer in this matrix-vector multiplication becomes apparent when M’s columns are segregated into evens and odds: = aMn(!) an 1 a0 a1 a2 a3 a4 .. . !jk k j = a2 a1 a3 an 1 .. . a0 .. . an 2 2k + 1 Column 2k Even !2jk !j !2jk columns Odd columns j Row j a2 a1 a3 an 1 .. . a0 .. . an 2 !2jk !2jk !j !2jk 2k + 1 Column j + n=2 2k !j !2jk In the second step, we have simpli ed entries in the bottom half of the matrix using!n=2 = 1 and !n = 1. Notice that the top left n=2 n=2 submatrix is Mn=2(!2), as is the one on the S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 79 Figure 2.9 The fast Fourier transform function FFT(a;!) Input: An array a = (a0;a1;:::;an 1), for n a power of 2 A primitive nth root of unity, ! Output: Mn(!)a if ! = 1: return a (s0;s1;:::;sn=2 1) = FFT((a0;a2;:::;an 2);!2) (s00;s01;:::;s0n=2 1) = FFT((a1;a3;:::;an 1);!2) for j = 0 to n=2 1: rj = sj +!js0j rj+n=2 = sj !js0j return (r0;r1;:::;rn 1) bottom left. And the top and bottom right submatrices are almost the same as Mn=2(!2), but with their jth rows multiplied through by !j and !j, respectively. Therefore the nal product is the vector a0 a2. .. an 2 a0 a2. .. an 2 Mn=2 Mn=2 a1 a3. .. an 1 a1 a3. .. an 1 Mn=2 Mn=2 + !j !jj + n=2 Row j In short, the product of Mn(!) with vector (a0;:::;an 1), a size-n problem, can be expressed in terms of two size-n=2 problems: the product of Mn=2(!2) with (a0;a2;:::;an 2) and with (a1;a3;:::;an 1). This divide-and-conquer strategy leads to the de nitive FFT algorithm of Figure 2.9, whose running time is T(n) = 2T(n=2) +O(n) = O(nlogn). The fast Fourier transform unraveled Throughout all our discussions so far, the fast Fourier transform has remained tightly co- cooned within a divide-and-conquer formalism. To fully expose its structure, we now unravel the recursion. The divide-and-conquer step of the FFT can be drawn as a very simple circuit. Here is how a problem of size n is reduced to two subproblems of size n=2 (for clarity, one pair of outputs (j;j +n=2) is singled out): 80 Algorithms a0 a2 a3 j + n=2 ja1 an 1 rj+n=2FFTn=2 FFTn=2... ... an 2 rj FFTn (input: a0;:::;an 1, output: r0;:::;rn 1) We’re using a particular shorthand: the edges are wires carrying complex numbers from left to right. A weight of j means multiply the number on this wire by !j. And when two wires come into a junction from the left, the numbers they are carrying get added up. So the two outputs depicted are executing the commands rj = sj +!js0j rj+n=2 = sj !js0j from the FFT algorithm (Figure 2.9), via a pattern of wires known as a butter y: . Unraveling the FFT circuit completely for n = 8 elements, we get Figure 10.4. Notice the following. 1. For n inputs there are log2n levels, each with n nodes, for a total of nlogn operations. 2. The inputs are arranged in a peculiar order: 0;4;2;6;1;5;3;7. Why? Recall that at the top level of recursion, we rst bring up the even coef cients of the input and then move on to the odd ones. Then at the next level, the even coef cients of this rst group (which therefore are multiples of 4, or equivalently, have zero as their two least signi cant bits) are brought up, and so on. To put it otherwise, the inputs are arranged by increasing last bit of the binary representation of their index, resolving ties by looking at the next more signi cant bit(s). The resulting order in binary, 000;100;010;110;001;101;011;111, is the same as the natural one, 000;001;010;011;100;101;110;111 except the bits are mirrored! 3. There is a unique path between each input aj and each output A(!k). This path is most easily described using the binary representations of j and k (shown in Figure 10.4 for convenience). There are two edges out of each node, one going up (the 0-edge) and one going down (the 1-edge). To get to A(!k) from any input node, simply follow the edges speci ed in the bit representation of k, starting from the rightmost bit. (Can you similarly specify the path in the reverse direction?) 4. On the path between aj and A(!k), the labels add up to jk mod 8. Since !8 = 1, this means that the contribution of input aj to output A(!k) is aj!jk, and therefore the circuit computes correctly the values of polynomial A(x). S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 81 5. And nally, notice that the FFT circuit is a natural for parallel computation and direct implementation in hardware. Figure 2.10 The fast Fourier transform circuit. a0a1 a2a3 a4a5 a6a7 a8a9 a10a11 a12a13 a14a15 a16a17 a18a19 a20a21 a22a23 a24a25 a26a27 a28a29 a30a31 a32a33 a34a35 a36a37 a38a39 a40a41 a42a43 a44a45 a46a47 a0 a4 a2 a6 a1 a5 a7 A(!1) A(!2) A(!3) A(!4) A(!5) A(!6) A(!7) a3 A(!0) 1 4 4 4 4 6 6 7 4 4 2 2 63 2 5 4 000 100 010 110 001 101 011 111 111 110 101 100 011 010 001 000 82 Algorithms The slow spread of a fast algorithm In 1963, during a meeting of President Kennedy’s scienti c advisors, John Tukey, a math- ematician from Princeton, explained to IBM’s Dick Garwin a fast method for computing Fourier transforms. Garwin listened carefully, because he was at the time working on ways to detect nuclear explosions from seismographic data, and Fourier transforms were the bot- tleneck of his method. When he went back to IBM, he asked John Cooley to implement Tukey’s algorithm; they decided that a paper should be published so that the idea could not be patented. Tukey was not very keen to write a paper on the subject, so Cooley took the initiative. And this is how one of the most famous and most cited scienti c papers was published in 1965, co-authored by Cooley and Tukey. The reason Tukey was reluctant to publish the FFT was not secretiveness or pursuit of pro t via patents. He just felt that this was a simple observation that was probably already known. This was typical of the period: back then (and for some time later) algorithms were considered second-class mathematical objects, devoid of depth and elegance, and unworthy of serious attention. But Tukey was right about one thing: it was later discovered that British engineers had used the FFT for hand calculations during the late 1930s. And to end this chapter with the same great mathematician who started it a paper by Gauss in the early 1800s on (what else?) interpolation contained essentially the same idea in it! Gauss’s paper had remained a secret for so long because it was protected by an old-fashioned cryptographic technique: like most scienti c papers of its era, it was written in Latin. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 83 Exercises 2.1. Use the divide-and-conquer integer multiplication algorithm to multiply the two binary integers 10011011 and 10111010. 2.2. Show that for any positive integers n and any base b, there must some power of b lying in the range [n;bn]. 2.3. Section 2.2 describes a method for solving recurrence relations which is based on analyzing the recursion tree and deriving a formula for the work done at each level. Another (closely related) method is to expand out the recurrence a few times, until a pattern emerges. For instance, let’s start with the familiar T(n) = 2T(n=2) +O(n). Think of O(n) as being cn for some constant c, so: T(n) 2T(n=2) +cn. By repeatedly applying this rule, we can bound T(n) in terms of T(n=2), then T(n=4), then T(n=8), and so on, at each step getting closer to the value of T( ) we do know, namely T(1) = O(1). T(n) 2T(n=2) +cn 2[2T(n=4) +cn=2] +cn = 4T(n=4) + 2cn 4[2T(n=8) +cn=4] + 2cn = 8T(n=8) + 3cn 8[2T(n=16) +cn=8] + 3cn = 16T(n=16) + 4cn ... A pattern is emerging... the general term is T(n) 2kT(n=2k) +kcn: Plugging in k = log2n, we get T(n) nT(1) +cnlog2n = O(nlogn). (a) Do the same thing for the recurrence T(n) = 3T(n=2) +O(n). What is the general kth term in this case? And what value of k should be plugged in to get the answer? (b) Now try the recurrence T(n) = T(n 1) +O(1), a case which is not covered by the master theorem. Can you solve this too? 2.4. Suppose you are choosing between the following three algorithms: Algorithm A solves problems by dividing them into ve subproblems of half the size, recur- sively solving each subproblem, and then combining the solutions in linear time. Algorithm B solves problems of size n by recursively solving two subproblems of size n 1 and then combining the solutions in constant time. Algorithm C solves problems of size n by dividing them into nine subproblems of size n=3, recursively solving each subproblem, and then combining the solutions in O(n2) time. What are the running times of each of these algorithms (in big-O notation), and which would you choose? 2.5. Solve the following recurrence relations and give a bound for each of them. (a) T(n) = 2T(n=3) + 1 (b) T(n) = 5T(n=4) +n 84 Algorithms (c) T(n) = 7T(n=7) +n (d) T(n) = 9T(n=3) +n2 (e) T(n) = 8T(n=2) +n3 (f) T(n) = 49T(n=25) +n3=2 logn (g) T(n) = T(n 1) + 2 (h) T(n) = T(n 1) +nc, where c 1 is a constant (i) T(n) = T(n 1) +cn, where c> 1 is some constant (j) T(n) = 2T(n 1) + 1 (k) T(n) = T(pn) + 1 2.6. A linear, time-invariant system has the following impulse response: a0a1 a2a3 a4a5 a6a7 a8a9 a10a11 a12a13 a14a15 a16a17 a18a19a20a21a22a23 a24 a24a25 a25 a26a27a28 a28a29 a30a31a32 a32a33 a34a35a36 a36a37 a37 a38a39a40 a40a41 a42a43 a44a45 a46a47 t b(t) t0 1=t0 (a) Describe in words the effect of this system. (b) What is the corresponding polynomial? 2.7. What is the sum of the nth roots of unity? What is their product if n is odd? If n is even? 2.8. Practice with the fast Fourier transform. (a) What is the FFT of (1;0;0;0)? What is the appropriate value of ! in this case? And of which sequence is (1;0;0;0) the FFT? (b) Repeat for (1;0;1; 1). 2.9. Practice with polynomial multiplication by FFT. (a) Suppose that you want to multiply the two polynomials x + 1 and x2 + 1 using the FFT. Choose an appropriate power of two, nd the FFT of the two sequences, multiply the results componentwise, and compute the inverse FFT to get the nal result. (b) Repeat for the pair of polynomials 1 +x+ 2x2 and 2 + 3x. 2.10. Find the unique polynomial of degree 4 that takes on values p(1) = 2, p(2) = 1, p(3) = 0, p(4) = 4, and p(5) = 0. Write your answer in the coef cient representation. 2.11. In justifying our matrix multiplication algorithm (Section 2.5), we claimed the following block- wise property: if X and Y are n n matrices, and X = A B C D ; Y = E F G H : S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 85 where A, B, C, D, E, F, G, and H are n=2 n=2 submatrices, then the product XY can be expressed in terms of these blocks: XY = A B C D E F G H = AE +BG AF +BH CE +DG CF +DH Prove this property. 2.12. How many lines, as a function of n (in ( ) form), does the following program print? Write a recurrence and solve it. You may assume n is a power of 2. function f(n) if n > 1: print_line(‘‘still going’’) f(n/2) f(n/2) 2.13. A binary tree is full if all of its vertices have either zero or two children. Let Bn denote the number of full binary trees with n vertices. (a) By drawing out all full binary trees with 3, 5, or 7 vertices, determine the exact values of B3, B5, and B7. Why have we left out even numbers of vertices, like B4? (b) For general n, derive a recurrence relation for Bn. (c) Show by induction that Bn is (2n). 2.14. You are given an array of n elements, and you notice that some of the elements are duplicates; that is, they appear more than once in the array. Show how to remove all duplicates from the array in time O(nlogn). 2.15. In our median- nding algorithm (Section 2.4), a basic primitive is the split operation, which takes as input an array S and a value v and then divides S into three sets: the elements less than v, the elements equal to v, and the elements greater than v. Show how to implement this split operation in place, that is, without allocating new memory. 2.16. You are given an in nite array A[ ] in which the rst n cells contain integers in sorted order and the rest of the cells are lled with1. You are not given the value ofn. Describe an algorithm that takes an integerx as input and nds a position in the array containingx, if such a position exists, in O(logn) time. (If you are disturbed by the fact that the array A has in nite length, assume instead that it is of length n, but that you don’t know this length, and that the implementation of the array data type in your programming language returns the error message 1 whenever elements A[i] with i>n are accessed.) 2.17. Given a sorted array of distinct integers A[1;:::;n], you want to nd out whether there is an index i for which A[i] = i. Give a divide-and-conquer algorithm that runs in time O(logn). 2.18. Consider the task of searching a sorted array A[1:::n] for a given element x: a task we usually perform by binary search in time O(logn). Show that any algorithm that accesses the array only via comparisons (that is, by asking questions of the form is A[i] z? ), must take (logn) steps. 2.19. A k-way merge operation. Suppose you have k sorted arrays, each with n elements, and you want to combine them into a single sorted array of kn elements. 86 Algorithms (a) Here’s one strategy: Using the merge procedure from Section 2.3, merge the rst two ar- rays, then merge in the third, then merge in the fourth, and so on. What is the time complexity of this algorithm, in terms of k and n? (b) Give a more ef cient solution to this problem, using divide-and-conquer. 2.20. Show that any array of integers x[1:::n] can be sorted in O(n+M) time, where M = maxi xi mini xi: For small M, this is linear time: why doesn’t the (nlogn) lower bound apply in this case? 2.21. Mean and median. One of the most basic tasks in statistics is to summarize a set of observations fx1;x2;:::;xng R by a single number. Two popular choices for this summary statistic are: The median, which we’ll call 1 The mean, which we’ll call 2 (a) Show that the median is the value of that minimizes the function X i jxi j: You can assume for simplicity that n is odd. (Hint: Show that for any 6= 1, the function decreases if you move either slightly to the left or slightly to the right.) (b) Show that the mean is the value of that minimizes the function X i (xi )2: One way to do this is by calculus. Another method is to prove that for any 2R, X i (xi )2 = X i (xi 2)2 +n( 2)2: Notice how the function for 2 penalizes points that are far from much more heavily than the function for 1. Thus 2 tries much harder to be close to all the observations. This might sound like a good thing at some level, but it is statistically undesirable because just a few outliers can severely throw off the estimate of 2. It is therefore sometimes said that 1 is a more robust estimator than 2. Worse than either of them, however, is 1, the value of that minimizes the function maxi jxi j: (c) Show that 1 can be computed in O(n) time (assuming the numbers xi are small enough that basic arithmetic operations on them take unit time). 2.22. You are given two sorted lists of size m and n. Give an O(logm + logn) time algorithm for computing the kth smallest element in the union of the two lists. 2.23. An array A[1:::n] is said to have a majority element if more than half of its entries are the same. Given an array, the task is to design an ef cient algorithm to tell whether the array has a majority element, and, if so, to nd that element. The elements of the array are not necessarily from some ordered domain like the integers, and so there can be no comparisons of the form is A[i] >A[j]? . (Think of the array elements as GIF les, say.) However you can answer questions of the form: is A[i] = A[j]? in constant time. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 87 (a) Show how to solve this problem in O(nlogn) time. (Hint: Split the array A into two arrays A1 andA2 of half the size. Does knowing the majority elements ofA1 andA2 help you gure out the majority element of A? If so, you can use a divide-and-conquer approach.) (b) Can you give a linear-time algorithm? (Hint: Here’s another divide-and-conquer approach: Pair up the elements of A arbitrarily, to get n=2 pairs Look at each pair: if the two elements are different, discard both of them; if they are the same, keep just one of them Show that after this procedure there are at most n=2 elements left, and that they have a majority element if and only if A does.) 2.24. On page 66 there is a high-level description of the quicksort algorithm. (a) Write down the pseudocode for quicksort. (b) Show that its worst-case running time on an array of size n is (n2). (c) Show that its expected running time satis es the recurrence relation T(n) O(n) + 1n n 1X i=1 (T(i) +T(n i)): Then, show that the solution to this recurrence is O(nlogn). 2.25. In Section 2.1 we described an algorithm that multiplies two n-bit binary integers x and y in time na, where a = log2 3. Call this procedure fastmultiply(x;y). (a) We want to convert the decimal integer 10n (a 1 followed by n zeros) into binary. Here is the algorithm (assume n is a power of 2): function pwr2bin(n) if n = 1: return 10102 else: z =??? return fastmultiply(z;z) Fill in the missing details. Then give a recurrence relation for the running time of the algorithm, and solve the recurrence. (b) Next, we want to convert any decimal integer x with n digits (where n is a power of 2) into binary. The algorithm is the following: function dec2bin(x) if n = 1: return binary[x] else: split x into two decimal numbers xL, xR with n=2 digits each return ??? Here binary[ ] is a vector that contains the binary representation of all one-digit integers. That is, binary[0] = 02, binary[1] = 12, up to binary[9] = 10012. Assume that a lookup in binary takes O(1) time. Fill in the missing details. Once again, give a recurrence for the running time of the algo- rithm, and solve it. 88 Algorithms 2.26. Professor F. Lake tells his class that it is asymptotically faster to square an n-bit integer than to multiply two n-bit integers. Should they believe him? 2.27. The square of a matrix A is its product with itself, AA. (a) Show that ve multiplications are suf cient to compute the square of a 2 2 matrix. (b) What is wrong with the following algorithm for computing the square of an n n matrix? Use a divide-and-conquer approach as in Strassen’s algorithm, except that in- stead of getting 7 subproblems of size n=2, we now get 5 subproblems of size n=2 thanks to part (a). Using the same analysis as in Strassen’s algorithm, we can conclude that the algorithm runs in time O(nlog2 5). (c) In fact, squaring matrices is no easier than matrix multiplication. In this part, you will show that if n n matrices can be squared in time S(n) = O(nc), then any two n n matrices can be multiplied in time O(nc). i. Given two n n matrices A and B, show that the matrix AB +BA can be computed in time 3S(n) +O(n2). ii. Given two n n matrices X and Y, de ne the 2n 2n matrices A and B as follows: A = X 0 0 0 and B = 0 Y 0 0 : What is AB +BA, in terms of X and Y? iii. Using (i) and (ii), argue that the product XY can be computed in time 3S(2n) +O(n2). Conclude that matrix multiplication takes time O(nc). 2.28. The Hadamard matrices H0;H1;H2;::: are de ned as follows: H0 is the 1 1 matrix 1 For k> 0, Hk is the 2k 2k matrix Hk = H k 1 Hk 1 Hk 1 Hk 1 Show that if v is a column vector of length n = 2k, then the matrix-vector product Hkv can be calculated using O(nlogn) operations. Assume that all the numbers involved are small enough that basic arithmetic operations like addition and multiplication take unit time. 2.29. Suppose we want to evaluate the polynomial p(x) = a0 +a1x+a2x2 + +anxn at point x. (a) Show that the following simple routine, known as Horner’s rule, does the job and leaves the answer in z. z = an for i = n 1 downto 0: z = zx+ai (b) How many additions and multiplications does this routine use, as a function of n? Can you nd a polynomial for which an alternative method is substantially better? 2.30. This problem illustrates how to do the Fourier Transform (FT) in modular arithmetic, for exam- ple, modulo 7. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 89 (a) There is a number ! such that all the powers !;!2;:::;!6 are distinct (modulo 7). Find this !, and show that ! +!2 + +!6 = 0. (Interestingly, for any prime modulus there is such a number.) (b) Using the matrix form of the FT, produce the transform of the sequence (0;1;1;1;5;2) mod- ulo 7; that is, multiply this vector by the matrix M6(!), for the value of ! you found earlier. In the matrix multiplication, all calculations should be performed modulo 7. (c) Write down the matrix necessary to perform the inverse FT. Show that multiplying by this matrix returns the original sequence. (Again all arithmetic should be performed modulo 7.) (d) Now show how to multiply the polynomials x2 +x+ 1 and x3 + 2x 1 using the FT modulo 7. 2.31. In Section 1.2.3, we studied Euclid’s algorithm for computing the greatest common divisor (gcd) of two positive integers: the largest integer which divides them both. Here we will look at an alternative algorithm based on divide-and-conquer. (a) Show that the following rule is true. gcd(a;b) = 8< : 2 gcd(a=2;b=2) if a;b are even gcd(a;b=2) if a is odd, b is even gcd((a b)=2;b) if a;b are odd (b) Give an ef cient divide-and-conquer algorithm for greatest common divisor. (c) How does the ef ciency of your algorithm compare to Euclid’s algorithm if a and b are n-bit integers? (In particular, since n might be large you cannot assume that basic arithmetic operations like addition take constant time.) 2.32. In this problem we will develop a divide-and-conquer algorithm for the following geometric task. CLOSEST PAIR Input: A set of points in the plane,fp1 = (x1;y1);p2 = (x2;y2);:::;pn = (xn;yn)g Output: The closest pair of points: that is, the pair pi 6= pj for which the distance between pi and pj, that is, q (xi xj)2 + (yi yj)2; is minimized. For simplicity, assume that n is a power of two, and that all the x-coordinates xi are distinct, as are the y-coordinates. Here’s a high-level overview of the algorithm: Find a value x for which exactly half the points have xi x. On this basis, split the points into two groups, L and R. Recursively nd the closest pair inLand inR. Say these pairs arepL;qL 2LandpR;qR 2R, with distances dL and dR respectively. Let d be the smaller of these two distances. It remains to be seen whether there is a point in L and a point in R that are less than distance d apart from each other. To this end, discard all points with xi x+d and sort the remaining points by y-coordinate. Now, go through this sorted list, and for each point, compute its distance to the seven sub- sequent points in the list. Let pM;qM be the closest pair found in this way. 90 Algorithms The answer is one of the three pairsfpL;qLg,fpR;qRg,fpM;qMg, whichever is closest. (a) In order to prove the correctness of this algorithm, start by showing the following property: any square of size d d in the plane contains at most four points of L. (b) Now show that the algorithm is correct. The only case which needs careful consideration is when the closest pair is split between L and R. (c) Write down the pseudocode for the algorithm, and show that its running time is given by the recurrence: T(n) = 2T(n=2) +O(nlogn): Show that the solution to this recurrence is O(nlog2n). (d) Can you bring the running time down to O(nlogn)? Chapter 3 Decompositions of graphs 3.1 Why graphs? A wide range of problems can be expressed with clarity and precision in the concise pictorial language of graphs. For instance, consider the task of coloring a political map. What is the minimum number of colors needed, with the obvious restriction that neighboring countries should have different colors? One of the dif culties in attacking this problem is that the map itself, even a stripped-down version like Figure 3.1(a), is usually cluttered with irrelevant information: intricate boundaries, border posts where three or more countries meet, open seas, and meandering rivers. Such distractions are absent from the mathematical object of Figure 3.1(b), a graph with one vertex for each country (1 is Brazil, 11 is Argentina) and edges between neighbors. It contains exactly the information needed for coloring, and nothing more. The precise goal is now to assign a color to each vertex so that no edge has endpoints of the same color. Graph coloring is not the exclusive domain of map designers. Suppose a university needs to schedule examinations for all its classes and wants to use the fewest time slots possible. The only constraint is that two exams cannot be scheduled concurrently if some student will be taking both of them. To express this problem as a graph, use one vertex for each exam and put an edge between two vertices if there is a con ict, that is, if there is somebody taking both endpoint exams. Think of each time slot as having its own color. Then, assigning time slots is exactly the same as coloring this graph! Some basic operations on graphs arise with such frequency, and in such a diversity of con- texts, that a lot of effort has gone into nding ef cient procedures for them. This chapter is devoted to some of the most fundamental of these algorithms those that uncover the basic connectivity structure of a graph. Formally, a graph is speci ed by a set of vertices (also called nodes) V and by edges E between select pairs of vertices. In the map example, V = f1;2;3;:::;13g and E includes, among many other edges,f1;2g;f9;11g, andf7;13g. Here an edge between x andy speci cally means x shares a border with y. This is a symmetric relation it implies also that y shares a border with x and we denote it using set notation, e = fx;yg. Such edges are undirected 91 92 Algorithms Figure 3.1 (a) A map and (b) its graph. (a) (b) 23 45 6 12 1 8 7 9 13 11 10 and are part of an undirected graph. Sometimes graphs depict relations that do not have this reciprocity, in which case it is necessary to use edges with directions on them. There can be directed edges e from x to y (written e = (x;y)), or from y to x (written (y;x)), or both. A particularly enormous example of a directed graph is the graph of all links in the World Wide Web. It has a vertex for each site on the Internet, and a directed edge (u;v) whenever site u has a link to site v: in total, billions of nodes and edges! Understanding even the most basic connectivity properties of the Web is of great economic and social interest. Although the size of this problem is daunting, we will soon see that a lot of valuable information about the structure of a graph can, happily, be determined in just linear time. 3.1.1 How is a graph represented? We can represent a graph by an adjacency matrix; if there are n =jVjvertices v1;:::;vn, this is an n n array whose (i;j)th entry is aij = 1 if there is an edge from v i to vj 0 otherwise. For undirected graphs, the matrix is symmetric since an edge fu;vg can be taken in either direction. The biggest convenience of this format is that the presence of a particular edge can be checked in constant time, with just one memory access. On the other hand the matrix takes S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 93 up O(n2) space, which is wasteful if the graph does not have very many edges. An alternative representation, with size proportional to the number of edges, is the adja- cency list. It consists ofjVjlinked lists, one per vertex. The linked list for vertex u holds the names of vertices to which u has an outgoing edge that is, vertices v for which (u;v) 2E. Therefore, each edge appears in exactly one of the linked lists if the graph is directed or two of the lists if the graph is undirected. Either way, the total size of the data structure is O(jEj). Checking for a particular edge (u;v) is no longer constant time, because it requires sifting through u’s adjacency list. But it is easy to iterate through all neighbors of a vertex (by run- ning down the corresponding linked list), and, as we shall soon see, this turns out to be a very useful operation in graph algorithms. Again, for undirected graphs, this representation has a symmetry of sorts: v is in u’s adjacency list if and only if u is in v’s adjacency list. How big is your graph? Which of the two representations, adjacency matrix or adjacency list, is better? Well, it de- pends on the relationship betweenjVj, the number of nodes in the graph, andjEj, the num- ber of edges. jEjcan be as small asjVj(if it gets much smaller, then the graph degenerates for example, has isolated vertices), or as large asjVj2 (when all possible edges are present). When jEj is close to the upper limit of this range, we call the graph dense. At the other extreme, if jEj is close to jVj, the graph is sparse. As we shall see in this chapter and the next two chapters, exactly wherejEjlies in this range is usually a crucial factor in selecting the right graph algorithm. Or, for that matter, in selecting the graph representation. If it is the World Wide Web graph that we wish to store in computer memory, we should think twice before using an adjacency matrix: at the time of writing, search engines know of about eight billion vertices of this graph, and hence the adjacency matrix would take up dozens of millions of terabits. Again at the time we write these lines, it is not clear that there is enough computer memory in the whole world to achieve this. (And waiting a few years until there is enough memory is unwise: the Web will grow too and will probably grow faster.) With adjacency lists, representing the World Wide Web becomes feasible: there are only a few dozen billion hyperlinks in the Web, and each will occupy a few bytes in the adjacency list. You can carry a device that stores the result, a terabyte or two, in your pocket (it may soon t in your earring, but by that time the Web will have grown too). The reason why adjacency lists are so much more effective in the case of the World Wide Web is that the Web is very sparse: the average Web page has hyperlinks to only about half a dozen other pages, out of the billions of possibilities. 3.2 Depth- rst search in undirected graphs 3.2.1 Exploring mazes Depth- rst search is a surprisingly versatile linear-time procedure that reveals a wealth of information about a graph. The most basic question it addresses is, 94 Algorithms Figure 3.2 Exploring a graph is rather like navigating a maze. A C B F D H I J K E G L H G DA C F K L J I B E What parts of the graph are reachable from a given vertex? To understand this task, try putting yourself in the position of a computer that has just been given a new graph, say in the form of an adjacency list. This representation offers just one basic operation: nding the neighbors of a vertex. With only this primitive, the reachability problem is rather like exploring a labyrinth (Figure 3.2). You start walking from a xed place and whenever you arrive at any junction (vertex) there are a variety of passages (edges) you can follow. A careless choice of passages might lead you around in circles or might cause you to overlook some accessible part of the maze. Clearly, you need to record some intermediate information during exploration. This classic challenge has amused people for centuries. Everybody knows that all you need to explore a labyrinth is a ball of string and a piece of chalk. The chalk prevents looping, by marking the junctions you have already visited. The string always takes you back to the starting place, enabling you to return to passages that you previously saw but did not yet investigate. How can we simulate these two primitives, chalk and string, on a computer? The chalk marks are easy: for each vertex, maintain a Boolean variable indicating whether it has been visited already. As for the ball of string, the correct cyberanalog is a stack. After all, the exact role of the string is to offer two primitive operations unwind to get to a new junction (the stack equivalent is to push the new vertex) and rewind to return to the previous junction (pop the stack). Instead of explicitly maintaining a stack, we will do so implicitly via recursion (which is implemented using a stack of activation records). The resulting algorithm is shown in Figure 3.3.1 The previsit and postvisit procedures are optional, meant for performing operations on a vertex when it is rst discovered and also when it is being left for the last time. We will soon see some creative uses for them. 1As with many of our graph algorithms, this one applies to both undirected and directed graphs. In such cases, we adopt the directed notation for edges, (x;y). If the graph is undirected, then each of its edges should be thought of as existing in both directions: (x;y) and (y; x). S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 95 Figure 3.3 Finding all nodes reachable from a particular node. procedure explore(G;v) Input: G = (V;E) is a graph; v2V Output: visited(u) is set to true for all nodes u reachable from v visited(v) = true previsit(v) for each edge (v;u)2E: if not visited(u): explore(u) postvisit(v) More immediately, we need to con rm that explore always works correctly. It certainly does not venture too far, because it only moves from nodes to their neighbors and can therefore never jump to a region that is not reachable from v. But does it nd all vertices reachable from v? Well, if there is some u that it misses, choose any path from v to u, and look at the last vertex on that path that the procedure actually visited. Call this node z, and let w be the node immediately after it on the same path. z wv u So z was visited but w was not. This is a contradiction: while the explore procedure was at node z, it would have noticed w and moved on to it. Incidentally, this pattern of reasoning arises often in the study of graphs and is in essence a streamlined induction. A more formal inductive proof would start by framing a hypothesis, such as for any k 0, all nodes within k hops from v get visited. The base case is as usual trivial, since v is certainly visited. And the general case showing that if all nodes k hops away are visited, then so are all nodes k + 1 hops away is precisely the same point we just argued. Figure 3.4 shows the result of running explore on our earlier example graph, starting at node A, and breaking ties in alphabetical order whenever there is a choice of nodes to visit. The solid edges are those that were actually traversed, each of which was elicited by a call to explore and led to the discovery of a new vertex. For instance, while B was being visited, the edge B E was noticed and, since E was as yet unknown, was traversed via a call to explore(E). These solid edges form a tree (a connected graph with no cycles) and are therefore called tree edges. The dotted edges were ignored because they led back to familiar terrain, to vertices previously visited. They are called back edges. 96 Algorithms Figure 3.4 The result of explore(A) on the graph of Figure 3.2. I E J C F B A D G H Figure 3.5 Depth- rst search. procedure dfs(G) for all v2V: visited(v) = false for all v2V: if not visited(v): explore(v) 3.2.2 Depth- rst search The explore procedure visits only the portion of the graph reachable from its starting point. To examine the rest of the graph, we need to restart the procedure elsewhere, at some vertex that has not yet been visited. The algorithm of Figure 3.5, called depth- rst search (DFS), does this repeatedly until the entire graph has been traversed. The rst step in analyzing the running time of DFS is to observe that each vertex is explore’d just once, thanks to the visited array (the chalk marks). During the exploration of a vertex, there are the following steps: 1. Some xed amount of work marking the spot as visited, and the pre/postvisit. 2. A loop in which adjacent edges are scanned, to see if they lead somewhere new. This loop takes a different amount of time for each vertex, so let’s consider all vertices to- gether. The total work done in step 1 is then O(jVj). In step 2, over the course of the entire DFS, each edgefx;yg2E is examined exactly twice, once during explore(x) and once dur- ing explore(y). The overall time for step 2 is therefore O(jEj) and so the depth- rst search S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 97 Figure 3.6 (a) A 12-node graph. (b) DFS search forest. (a) A B C D E F G H I J K L (b) A B E I J G K FC D H L 1,10 2,3 4,9 5,8 6,7 11,22 23,24 12,21 13,20 14,17 15,16 18,19 has a running time of O(jVj+jEj), linear in the size of its input. This is as ef cient as we could possibly hope for, since it takes this long even just to read the adjacency list. Figure 3.6 shows the outcome of depth- rst search on a 12-node graph, once again break- ing ties alphabetically (ignore the pairs of numbers for the time being). The outer loop of DFS calls explore three times, on A, C, and nally F. As a result, there are three trees, each rooted at one of these starting points. Together they constitute a forest. 3.2.3 Connectivity in undirected graphs An undirected graph is connected if there is a path between any pair of vertices. The graph of Figure 3.6 is not connected because, for instance, there is no path from A to K. However, it does have three disjoint connected regions, corresponding to the following sets of vertices: fA;B;E;I;Jg fC;D;G;H;K;Lg fFg These regions are called connected components: each of them is a subgraph that is internally connected but has no edges to the remaining vertices. Whenexploreis started at a particular vertex, it identi es precisely the connected component containing that vertex. And each time the DFS outer loop calls explore, a new connected component is picked out. Thus depth- rst search is trivially adapted to check if a graph is connected and, more generally, to assign each node v an integer ccnum[v] identifying the connected component to which it belongs. All it takes is procedure previsit(v) ccnum[v] = cc where cc needs to be initialized to zero and to be incremented each time the DFS procedure calls explore. 98 Algorithms 3.2.4 Previsit and postvisit orderings We have seen how depth- rst search a few unassuming lines of code is able to uncover the connectivity structure of an undirected graph in just linear time. But it is far more versatile than this. In order to stretch it further, we will collect a little more information during the ex- ploration process: for each node, we will note down the times of two important events, the mo- ment of rst discovery (corresponding to previsit) and that of nal departure (postvisit). Figure 3.6 shows these numbers for our earlier example, in which there are 24 events. The fth event is the discovery of I. The 21st event consists of leaving D behind for good. One way to generate arrays preandpostwith these numbers is to de ne a simple counter clock, initially set to 1, which gets updated as follows. procedure previsit(v) pre[v] = clock clock = clock + 1 procedure postvisit(v) post[v] = clock clock = clock + 1 These timings will soon take on larger signi cance. Meanwhile, you might have noticed from Figure 3.4 that: Property For any nodes u and v, the two intervals [pre(u);post(u)] and [pre(v);post(v)] are either disjoint or one is contained within the other. Why? Because [pre(u);post(u)] is essentially the time during which vertex u was on the stack. The last-in, rst-out behavior of a stack explains the rest. 3.3 Depth- rst search in directed graphs 3.3.1 Types of edges Our depth- rst search algorithm can be run verbatim on directed graphs, taking care to tra- verse edges only in their prescribed directions. Figure 3.7 shows an example and the search tree that results when vertices are considered in lexicographic order. In further analyzing the directed case, it helps to have terminology for important relation- ships between nodes of a tree. Ais the root of the search tree; everything else is its descendant. Similarly, E has descendants F, G, and H, and conversely, is an ancestor of these three nodes. The family analogy is carried further: C is the parent of D, which is its child. For undirected graphs we distinguished between tree edges and nontree edges. In the directed case, there is a slightly more elaborate taxonomy: S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 99 Figure 3.7 DFS on a directed graph. AB C F DE G H A H B C E D F G 12,15 13,14 1,16 2,11 4,7 5,6 8,9 3,10 Tree edges are actually part of the DFS forest. Forward edges lead from a node to a nonchild descendant in the DFS tree. Back edges lead to an ancestor in the DFS tree. Cross edges lead to neither descendant nor ancestor; they therefore lead to a node that has already been completely explored (that is, already postvisited). Back Forward Cross Tree A B C D DFS tree Figure 3.7 has two forward edges, two back edges, and two cross edges. Can you spot them? Ancestor and descendant relationships, as well as edge types, can be read off directly from pre and post numbers. Because of the depth- rst exploration strategy, vertex u is an ancestor of vertex v exactly in those cases where u is discovered rst and v is discovered during explore(u). This is to say pre(u) < pre(v) < post(v) < post(u), which we can depict pictorially as two nested intervals: u v v u The case of descendants is symmetric, since u is a descendant of v if and only if v is an an- cestor of u. And since edge categories are based entirely on ancestor-descendant relationships, 100 Algorithms it follows that they, too, can be read off from pre and post numbers. Here is a summary of the various possibilities for an edge (u;v): pre/post ordering for (u;v) Edge type u v v u Tree/forward v u u v Back v uv u Cross You can con rm each of these characterizations by consulting the diagram of edge types. Do you see why no other orderings are possible? 3.3.2 Directed acyclic graphs A cycle in a directed graph is a circular path v0 !v1 !v2 ! !vk !v0. Figure 3.7 has quite a few of them, for example, B!E!F !B. A graph without cycles is acyclic. It turns out we can test for acyclicity in linear time, with a single depth- rst search. Property A directed graph has a cycle if and only if its depth- rst search reveals a back edge. Proof. One direction is quite easy: if (u;v) is a back edge, then there is a cycle consisting of this edge together with the path from v to u in the search tree. Conversely, if the graph has a cycle v0 !v1 ! !vk !v0, look at the rst node on this cycle to be discovered (the node with the lowest pre number). Suppose it is vi. All the other vj on the cycle are reachable from it and will therefore be its descendants in the search tree. In particular, the edge vi 1 !vi (or vk !v0 if i = 0) leads from a node to its ancestor and is thus by de nition a back edge. Directed acyclic graphs, or dags for short, come up all the time. They are good for modeling relations like causalities, hierarchies, and temporal dependencies. For example, suppose that you need to perform many tasks, but some of them cannot begin until certain others are completed (you have to wake up before you can get out of bed; you have to be out of bed, but not yet dressed, to take a shower; and so on). The question then is, what is a valid order in which to perform the tasks? Such constraints are conveniently represented by a directed graph in which each task is a node, and there is an edge from u to v if u is a precondition for v. In other words, before performing a task, all the tasks pointing to it must be completed. If this graph has a cycle, there is no hope: no ordering can possibly work. If on the other hand the graph is a dag, we would like if possible to linearize (or topologically sort) it, to order the vertices one after the other in such a way that each edge goes from an earlier vertex to a later vertex, so that all precedence constraints are satis ed. In Figure 3.8, for instance, one valid ordering is B;A;D;C;E;F. (Can you spot the other three?) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 101 Figure 3.8 A directed acyclic graph with one source, two sinks, and four possible lineariza- tions. A B C D E F What types of dags can be linearized? Simple: All of them. And once again depth- rst search tells us exactly how to do it: simply perform tasks in decreasing order of their post numbers. After all, the only edges (u;v) in a graph for which post(u) 1, de ne pk(v) = pk 1(p(v)) and p1(v) = p(v) (so pk(v) is the kth ancestor of v). Each vertex v of the tree has an associated non-negative integer label l(v). Give a linear-time algorithm to update the labels of all the vertices in T according to the following rule: lnew(v) = l(pl(v)(v)). 3.21. Give a linear-time algorithm to nd an odd-length cycle in a directed graph. (Hint: First solve this problem under the assumption that the graph is strongly connected.) 110 Algorithms 3.22. Give an ef cient algorithm which takes as input a directed graph G = (V;E), and determines whether or not there is a vertex s2V from which all other vertices are reachable. 3.23. Give an ef cient algorithm that takes as input a directed acyclic graph G = (V;E), and two vertices s;t2V, and outputs the number of different directed paths from s to t in G. 3.24. Give a linear-time algorithm for the following task. Input: A directed acyclic graph G Question: Does G contain a directed path that touches every vertex exactly once? 3.25. You are given a directed graph in which each node u2V has an associated price pu which is a positive integer. De ne the array cost as follows: for each u2V, cost[u] = price of the cheapest node reachable from u (including u itself). For instance, in the graph below (with prices shown for each vertex), the cost values of the nodes A;B;C;D;E;F are 2;1;4;1;4;5, respectively. A B C D E F 1 5 462 3 Your goal is to design an algorithm that lls in the entire cost array (i.e., for all vertices). (a) Give a linear-time algorithm that works for directed acyclic graphs. (Hint: Handle the vertices in a particular order.) (b) Extend this to a linear-time algorithm that works for all directed graphs. (Hint: Recall the two-tiered structure of directed graphs.) 3.26. An Eulerian tour in an undirected graph is a cycle that is allowed to pass through each vertex multiple times, but must use each edge exactly once. This simple concept was used by Euler in 1736 to solve the famous Konigsberg bridge problem, which launched the eld of graph theory. The city of Konigsberg (now called Kaliningrad, in western Russia) is the meeting point of two rivers with a small island in the middle. There are seven bridges across the rivers, and a popular recreational question of the time was to determine whether it is possible to perform a tour in which each bridge is crossed exactly once. Euler formulated the relevant information as a graph with four nodes (denoting land masses) and seven edges (denoting bridges), as shown here. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 111 Southern bank Northern bank Small island Big island Notice an unusual feature of this problem: multiple edges between certain pairs of nodes. (a) Show that an undirected graph has an Eulerian tour if and only if all its vertices have even degree. Conclude that there is no Eulerian tour of the Konigsberg bridges. (b) An Eulerian path is a path which uses each edge exactly once. Can you give a similar if-and-only-if characterization of which undirected graphs have Eulerian paths? (c) Can you give an analog of part (a) for directed graphs? 3.27. Two paths in a graph are called edge-disjoint if they have no edges in common. Show that in any undirected graph, it is possible to pair up the vertices of odd degree and nd paths between each such pair so that all these paths are edge-disjoint. 3.28. In the 2SAT problem, you are given a set of clauses, where each clause is the disjunction (OR) of two literals (a literal is a Boolean variable or the negation of a Boolean variable). You are looking for a way to assign a value true or false to each of the variables so that all clauses are satis ed that is, there is at least one true literal in each clause. For example, here’s an instance of 2SAT: (x1_x2) ^ (x1_x3) ^ (x1_x2) ^ (x3_x4) ^ (x1_x4): This instance has a satisfying assignment: set x1, x2, x3, and x4 to true, false, false, and true, respectively. (a) Are there other satisfying truth assignments of this 2SAT formula? If so, nd them all. (b) Give an instance of 2SAT with four variables, and with no satisfying assignment. The purpose of this problem is to lead you to a way of solving 2SAT ef ciently by reducing it to the problem of nding the strongly connected components of a directed graph. Given an instance I of 2SAT with n variables and m clauses, construct a directed graph GI = (V;E) as follows. GI has 2n nodes, one for each variable and its negation. GI has 2m edges: for each clause ( _ ) of I (where ; are literals), GI has an edge from from the negation of to , and one from the negation of to . Note that the clause ( _ ) is equivalent to either of the implications ) or ) . In this sense, GI records all implications in I. (c) Carry out this construction for the instance of 2SAT given above, and for the instance you constructed in (b). 112 Algorithms (d) Show that if GI has a strongly connected component containing both x and x for some variable x, then I has no satisfying assignment. (e) Now show the converse of (d): namely, that if none of GI’s strongly connected components contain both a literal and its negation, then the instance I must be satis able. (Hint: As- sign values to the variables as follows: repeatedly pick a sink strongly connected component of GI. Assign value true to all literals in the sink, assign false to their negations, and delete all of these. Show that this ends up discovering a satisfying assignment.) (f) Conclude that there is a linear-time algorithm for solving 2SAT. 3.29. Let S be a nite set. A binary relation on S is simply a collection R of ordered pairs (x;y)2S S. For instance, S might be a set of people, and each such pair (x;y)2R might mean x knows y. An equivalence relation is a binary relation which satis es three properties: Re exivity: (x;x)2R for all x2S Symmetry: if (x;y)2R then (y;x)2R Transitivity: if (x;y)2R and (y;z)2R then (x;z)2R For instance, the binary relation has the same birthday as is an equivalence relation, whereas is the father of is not, since it violates all three properties. Show that an equivalence relation partitions set S into disjoint groups S1;S2;:::;Sk (in other words, S = S1[S2[ [Sk and Si\Sj =;for all i6= j) such that: Any two members of a group are related, that is, (x;y)2R for any x;y2Si, for any i. Members of different groups are not related, that is, for all i6= j, for all x2Si and y2Sj, we have (x;y)62R. (Hint: Represent an equivalence relation by an undirected graph.) 3.30. On page 102, we de ned the binary relation connected on the set of vertices of a directed graph. Show that this is an equivalence relation (see Exercise 3.29), and conclude that it partitions the vertices into disjoint strongly connected components. 3.31. Biconnected components Let G = (V;E) be an undirected graph. For any two edgese;e02E, we’ll say e e0 if either e = e0 or there is a (simple) cycle containing both e and e0. (a) Show that is an equivalence relation (recall Exercise 3.29) on the edges. The equivalence classes into which this relation partitions the edges are called the biconnected components of G. A bridge is an edge which is in a biconnected component all by itself. A separating vertex is a vertex whose removal disconnects the graph. (b) Partition the edges of the graph below into biconnected components, and identify the bridges and separating vertices. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 113 C DA B E F G O N M L K J I H Not only do biconnected components partition the edges of the graph, they also almost partition the vertices in the following sense. (c) Associate with each biconnected component all the vertices that are endpoints of its edges. Show that the vertices corresponding to two different biconnected components are either disjoint or intersect in a single separating vertex. (d) Collapse each biconnected component into a single meta-node, and retain individual nodes for each separating vertex. (So there are edges between each component-node and its sep- arating vertices.) Show that the resulting graph is a tree. DFS can be used to identify the biconnected components, bridges, and separating vertices of a graph in linear time. (e) Show that the root of the DFS tree is a separating vertex if and only if it has more than one child in the tree. (f) Show that a non-root vertex v of the DFS tree is a separating vertex if and only if it has a child v0 none of whose descendants (including itself) has a backedge to a proper ancestor of v. (g) For each vertex u de ne: low(u) = min pre(u) pre(w) where (v;w) is a backedge for some descendant v of u Show that the entire array of low values can be computed in linear time. (h) Show how to compute all separating vertices, bridges, and biconnected components of a graph in linear time. (Hint: Use low to identify separating vertices, and run another DFS with an extra stack of edges to remove biconnected components one at a time.) 114 Algorithms Chapter 4 Paths in graphs 4.1 Distances Depth- rst search readily identi es all the vertices of a graph that can be reached from a designated starting point. It also nds explicit paths to these vertices, summarized in its search tree (Figure 4.1). However, these paths might not be the most economical ones possi- ble. In the gure, vertex C is reachable from S by traversing just one edge, while the DFS tree shows a path of length 3. This chapter is about algorithms for nding shortest paths in graphs. Path lengths allow us to talk quantitatively about the extent to which different vertices of a graph are separated from each other: The distance between two nodes is the length of the shortest path between them. To get a concrete feel for this notion, consider a physical realization of a graph that has a ball for each vertex and a piece of string for each edge. If you lift the ball for vertex s high enough, the other balls that get pulled up along with it are precisely the vertices reachable from s. And to nd their distances from s, you need only measure how far below s they hang. Figure 4.1 (a) A simple graph and (b) its depth- rst search tree. (a) E AS BD C (b) S A B D E C 115 116 Algorithms Figure 4.2 A physical model of a graph. B E S D C A S D EC B A In Figure 4.2 for example, vertex B is at distance 2 from S, and there are two shortest paths to it. When S is held up, the strings along each of these paths become taut. On the other hand, edge (D;E) plays no role in any shortest path and therefore remains slack. 4.2 Breadth- rst search In Figure 4.2, the lifting of s partitions the graph into layers: s itself, the nodes at distance 1 from it, the nodes at distance 2 from it, and so on. A convenient way to compute distances from s to the other vertices is to proceed layer by layer. Once we have picked out the nodes at distance 0;1;2;:::;d, the ones at d+ 1 are easily determined: they are precisely the as-yet- unseen nodes that are adjacent to the layer at distanced. This suggests an iterative algorithm in which two layers are active at any given time: some layer d, which has been fully identi ed, and d+ 1, which is being discovered by scanning the neighbors of layer d. Breadth- rst search (BFS) directly implements this simple reasoning (Figure 4.3). Ini- tially the queue Q consists only of s, the one node at distance 0. And for each subsequent distance d = 1;2;3;:::, there is a point in time at which Q contains all the nodes at distance d and nothing else. As these nodes are processed (ejected off the front of the queue), their as-yet-unseen neighbors are injected into the end of the queue. Let’s try out this algorithm on our earlier example (Figure 4.1) to con rm that it does the right thing. If S is the starting point and the nodes are ordered alphabetically, they get visited in the sequence shown in Figure 4.4. The breadth- rst search tree, on the right, contains the edges through which each node is initially discovered. Unlike the DFS tree we saw earlier, it has the property that all its paths from S are the shortest possible. It is therefore a shortest- path tree. Correctness and ef ciency We have developed the basic intuition behind breadth- rst search. In order to check that the algorithm works correctly, we need to make sure that it faithfully executes this intuition. What we expect, precisely, is that For each d = 0;1;2;:::, there is a moment at which (1) all nodes at distance d S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 117 Figure 4.3 Breadth- rst search. procedure bfs(G;s) Input: Graph G = (V;E), directed or undirected; vertex s2V Output: For all vertices u reachable from s, dist(u) is set to the distance from s to u. for all u2V: dist(u) =1 dist(s) = 0 Q = [s] (queue containing just s) while Q is not empty: u = eject(Q) for all edges (u;v)2E: if dist(v) =1: inject(Q;v) dist(v) = dist(u) + 1 from s have their distances correctly set; (2) all other nodes have their distances set to1; and (3) the queue contains exactly the nodes at distance d. This has been phrased with an inductive argument in mind. We have already discussed both the base case and the inductive step. Can you ll in the details? The overall running time of this algorithm is linear, O(jVj+jEj), for exactly the same reasons as depth- rst search. Each vertex is put on the queue exactly once, when it is rst en- countered, so there are 2jVjqueue operations. The rest of the work is done in the algorithm’s innermost loop. Over the course of execution, this loop looks at each edge once (in directed graphs) or twice (in undirected graphs), and therefore takes O(jEj) time. Now that we have both BFS and DFS before us: how do their exploration styles compare? Depth- rst search makes deep incursions into a graph, retreating only when it runs out of new nodes to visit. This strategy gives it the wonderful, subtle, and extremely useful properties we saw in the Chapter 3. But it also means that DFS can end up taking a long and convoluted route to a vertex that is actually very close by, as in Figure 4.1. Breadth- rst search makes sure to visit vertices in increasing order of their distance from the starting point. This is a broader, shallower search, rather like the propagation of a wave upon water. And it is achieved using almost exactly the same code as DFS but with a queue in place of a stack. Also notice one stylistic difference from DFS: since we are only interested in distances from s, we do not restart the search in other connected components. Nodes not reachable from s are simply ignored. 118 Algorithms Figure 4.4 The result of breadth- rst search on the graph of Figure 4.1. Order Queue contents of visitation after processing node [S] S [AC DE] A [C DE B] C [DE B] D [E B] E [B] B [ ] DA B C E S Figure 4.5 Edge lengths often matter. Francisco San Los Angeles Bakersfield Sacramento Reno Las Vegas 409 290 95 271 133 445 291 112 275 4.3 Lengths on edges Breadth- rst search treats all edges as having the same length. This is rarely true in ap- plications where shortest paths are to be found. For instance, suppose you are driving from San Francisco to Las Vegas, and want to nd the quickest route. Figure 4.5 shows the major highways you might conceivably use. Picking the right combination of them is a shortest-path problem in which the length of each edge (each stretch of highway) is important. For the re- mainder of this chapter, we will deal with this more general scenario, annotating every edge e2E with a length le. If e = (u;v), we will sometimes also write l(u;v) or luv. These le’s do not have to correspond to physical lengths. They could denote time (driving time between cities) or money (cost of taking a bus), or any other quantity that we would like to conserve. In fact, there are cases in which we need to use negative lengths, but we will brie y overlook this particular complication. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 119 Figure 4.6 Breaking edges into unit-length pieces. C A B E D C E DB A1 2 2 4 23 1 4.4 Dijkstra’s algorithm 4.4.1 An adaptation of breadth- rst search Breadth- rst search nds shortest paths in any graph whose edges have unit length. Can we adapt it to a more general graph G = (V;E) whose edge lengths le are positive integers? A more convenient graph Here is a simple trick for converting G into something BFS can handle: break G’s long edges into unit-length pieces, by introducing dummy nodes. Figure 4.6 shows an example of this transformation. To construct the new graph G0, For any edge e = (u;v) of E, replace it by le edges of length 1, by adding le 1 dummy nodes between u and v. Graph G0 contains all the vertices V that interest us, and the distances between them are exactly the same as in G. Most importantly, the edges of G0 all have unit length. Therefore, we can compute distances in G by running BFS on G0. Alarm clocks If ef ciency were not an issue, we could stop here. But when G has very long edges, the G0 it engenders is thickly populated with dummy nodes, and the BFS spends most of its time diligently computing distances to these nodes that we don’t care about at all. To see this more concretely, consider the graphs G and G0 of Figure 4.7, and imagine that the BFS, started at node s of G0, advances by one unit of distance per minute. For the rst 99 minutes it tediously progresses along S A and S B, an endless desert of dummy nodes. Is there some way we can snooze through these boring phases and have an alarm wake us up whenever something interesting is happening speci cally, whenever one of the real nodes (from the original graph G) is reached? We do this by setting two alarms at the outset, one for node A, set to go off at time T = 100, and one for B, at time T = 200. These are estimated times of arrival, based upon the edges currently being traversed. We doze off and awake at T = 100 to ndA has been discovered. At 120 Algorithms this point, the estimated time of arrival for B is adjusted to T = 150 and we change its alarm accordingly. More generally, at any given moment the breadth- rst search is advancing along certain edges of G, and there is an alarm for every endpoint node toward which it is moving, set to go off at the estimated time of arrival at that node. Some of these might be overestimates be- cause BFS may later nd shortcuts, as a result of future arrivals elsewhere. In the preceding example, a quicker route to B was revealed upon arrival at A. However, nothing interesting can possibly happen before an alarm goes off. The sounding of the next alarm must therefore signal the arrival of the wavefront to a real node u2V by BFS. At that point, BFS might also start advancing along some new edges out of u, and alarms need to be set for their endpoints. The following alarm clock algorithm faithfully simulates the execution of BFS on G0. Set an alarm clock for node s at time 0. Repeat until there are no more alarms: Say the next alarm goes off at time T, for node u. Then: The distance from s to u is T. For each neighbor v of u in G: If there is no alarm yet for v, set one for time T +l(u;v). If v’s alarm is set for later than T +l(u;v), then reset it to this earlier time. Dijkstra’s algorithm. The alarm clock algorithm computes distances in any graph with positive integral edge lengths. It is almost ready for use, except that we need to somehow implement the system of alarms. The right data structure for this job is a priority queue (usually implemented via a heap), which maintains a set of elements (nodes) with associated numeric key values (alarm times) and supports the following operations: Insert. Add a new element to the set. Decrease-key. Accommodate the decrease in key value of a particular element.1 1The name decrease-key is standard but is a little misleading: the priority queue typically does not itself change key values. What this procedure really does is to notify the queue that a certain key value has been decreased. Figure 4.7 BFS on G0 is mostly uneventful. The dotted lines show some early wavefronts. G: A B S 200 100 50 G0: S A B S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 121 Delete-min. Return the element with the smallest key, and remove it from the set. Make-queue. Build a priority queue out of the given elements, with the given key values. (In many implementations, this is signi cantly faster than inserting the elements one by one.) The rst two let us set alarms, and the third tells us which alarm is next to go off. Putting this all together, we get Dijkstra’s algorithm (Figure 4.8). In the code, dist(u) refers to the current alarm clock setting for node u. A value of 1 means the alarm hasn’t so far been set. There is also a special array, prev, that holds one crucial piece of information for each node u: the identity of the node immediately before it on the shortest path from s to u. By following these back-pointers, we can easily reconstruct shortest paths, and so this array is a compact summary of all the paths found. A full example of the algorithm’s operation, along with the nal shortest-path tree, is shown in Figure 4.9. In summary, we can think of Dijkstra’s algorithm as just BFS, except it uses a priority queue instead of a regular queue, so as to prioritize nodes in a way that takes edge lengths into account. This viewpoint gives a concrete appreciation of how and why the algorithm works, but there is a more direct, more abstract derivation that doesn’t depend upon BFS at all. We now start from scratch with this complementary interpretation. Figure 4.8 Dijkstra’s shortest-path algorithm. procedure dijkstra(G;l;s) Input: Graph G = (V;E), directed or undirected; positive edge lengths fle : e2Eg; vertex s2V Output: For all vertices u reachable from s, dist(u) is set to the distance from s to u. for all u2V: dist(u) =1 prev(u) = nil dist(s) = 0 H = makequeue(V) (using dist-values as keys) while H is not empty: u = deletemin(H) for all edges (u;v)2E: if dist(v) >dist(u) +l(u;v): dist(v) = dist(u) +l(u;v) prev(v) = u decreasekey(H;v) 122 Algorithms Figure 4.9 A complete run of Dijkstra’s algorithm, with node A as the starting point. Also shown are the associated dist values and the nal shortest-path tree. B C D E A 4 1 3 2 4 1 3 5 2 A: 0 D:1 B: 4 E:1 C: 2 B C D E A 4 2 4 1 3 5 2 1 3 A: 0 D: 6 B: 3 E: 7 C: 2 B C D E A 4 1 3 2 4 1 3 5 2 A: 0 D: 5 B: 3 E: 6 C: 2 B C D E A 4 1 3 2 1 5 2 3 4 A: 0 D: 5 B: 3 E: 6 C: 2 B C D E A 2 1 3 2 S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 123 Figure 4.10 Single-edge extensions of known shortest paths. a0 a0a1 a1 a2a3 a4a5 a6a7 a8a9 a10 a10a11 a12 a12a13 a13 a14 a14a15 a16a17 a18 a18a19 a19 a20 a20a21 a21 a22a23 a24 a24a25 a26 a26a27 a27a28a29 a30a31 a32a33 a34a35 a36a37a38a39 a40 a40a41 a42 a42a43 a44 a44a45 a46 a46a47 a48 a48a49 a49 s u R Known region v 4.4.2 An alternative derivation Here’s a plan for computing shortest paths: expand outward from the starting points, steadily growing the region of the graph to which distances and shortest paths are known. This growth should be orderly, rst incorporating the closest nodes and then moving on to those further away. More precisely, when the known region is some subset of vertices R that includes s, the next addition to it should be the node outside R that is closest to s. Let us call this node v; the question is: how do we identify it? To answer, consider u, the node just before v in the shortest path from s to v: a50a51 a52a53 a54a55 v us Since we are assuming that all edge lengths are positive, u must be closer to s than v is. This means that u is in R otherwise it would contradict v’s status as the closest node to s outside R. So, the shortest path from s to v is simply a known shortest path extended by a single edge. But there will typically be many single-edge extensions of the currently known shortest paths (Figure 4.10); which of these identi es v? The answer is, the shortest of these extended paths. Because, if an even shorter single-edge-extended path existed, this would once more contradict v’s status as the node outside R closest to s. So, it’s easy to nd v: it is the node outside R for which the smallest value of distance(s;u) +l(u;v) is attained, as u ranges over R. In other words, try all single-edge extensions of the currently known shortest paths, nd the shortest such extended path, and proclaim its endpoint to be the next node of R. We now have an algorithm for growing R by looking at extensions of the current set of shortest paths. Some extra ef ciency comes from noticing that on any given iteration, the only new extensions are those involving the node most recently added to region R. All other extensions will have been assessed previously and do not need to be recomputed. In the following pseudocode, dist(v) is the length of the currently shortest single-edge-extended path leading to v; it is1for nodes not adjacent to R. 124 Algorithms Initialize dist(s) to 0, other dist( ) values to 1 R =f g (the ‘‘known region’’) while R6= V: Pick the node v62R with smallest dist( ) Add v to R for all edges (v;z)2E: if dist(z) >dist(v) +l(v;z): dist(z) = dist(v) +l(v;z) Incorporating priority queue operations gives us back Dijkstra’s algorithm (Figure 4.8). To justify this algorithm formally, we would use a proof by induction, as with breadth- rst search. Here’s an appropriate inductive hypothesis. At the end of each iteration of the while loop, the following conditions hold: (1) there is a value d such that all nodes in R are at distance d from s and all nodes outside R are at distance d from s, and (2) for every node u, the value dist(u) is the length of the shortest path from s to u whose intermediate nodes are constrained to be in R (if no such path exists, the value is1). The base case is straightforward (with d = 0), and the details of the inductive step can be lled in from the preceding discussion. 4.4.3 Running time At the level of abstraction of Figure 4.8, Dijkstra’s algorithm is structurally identical to breadth- rst search. However, it is slower because the priority queue primitives are com- putationally more demanding than the constant-time eject’s and inject’s of BFS. Since makequeue takes at most as long asjVjinsert operations, we get a total ofjVjdeletemin and jVj+jEjinsert/decreasekey operations. The time needed for these varies by imple- mentation; for instance, a binary heap gives an overall running time of O((jVj+jEj)logjVj). S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 125 Which heap is best? The running time of Dijkstra’s algorithm depends heavily on the priority queue implemen- tation used. Here are the typical choices. Implementation deletemin insert/decreasekey jVj deletemin +(jVj+jEj) insert Array O(jVj) O(1) O(jVj2) Binary heap O(logjVj) O(logjVj) O((jVj+jEj)logjVj) d-ary heap O(d logjV jlog d ) O(logjV jlog d ) O((jVj d+jEj)logjV jlog d ) Fibonacci heap O(logjVj) O(1) (amortized) O(jVjlogjVj+jEj) So for instance, even a naive array implementation gives a respectable time complexity of O(jVj2), whereas with a binary heap we get O((jVj+jEj)logjVj). Which is preferable? This depends on whether the graph is sparse (has few edges) or dense (has lots of them). For all graphs,jEjis less thanjVj2. If it is (jVj2), then clearly the array implementation is the faster. On the other hand, the binary heap becomes preferable as soon asjEjdips below jVj2=logjVj. The d-ary heap is a generalization of the binary heap (which corresponds to d = 2) and leads to a running time that is a function of d. The optimal choice is d jEj=jVj; in other words, to optimize we must set the degree of the heap to be equal to the average degree of the graph. This works well for both sparse and dense graphs. For very sparse graphs, in which jEj = O(jVj), the running time is O(jVjlogjVj), as good as with a binary heap. For dense graphs,jEj= (jVj2) and the running time is O(jVj2), as good as with a linked list. Finally, for graphs with intermediate densityjEj=jVj1+ , the running time is O(jEj), linear! The last line in the table gives running times using a sophisticated data structure called a Fibonacci heap. Although its ef ciency is impressive, this data structure requires con- siderably more work to implement than the others, and this tends to dampen its appeal in practice. We will say little about it except to mention a curious feature of its time bounds. Its insert operations take varying amounts of time but are guaranteed to average O(1) over the course of the algorithm. In such situations (one of which we shall encounter in Chapter 5) we say that the amortized cost of heap insert’s is O(1). 126 Algorithms 4.5 Priority queue implementations 4.5.1 Array The simplest implementation of a priority queue is as an unordered array of key values for all potential elements (the vertices of the graph, in the case of Dijkstra’s algorithm). Initially, these values are set to1. An insert or decreasekeyis fast, because it just involves adjusting a key value, anO(1) operation. To deletemin, on the other hand, requires a linear-time scan of the list. 4.5.2 Binary heap Here elements are stored in a complete binary tree, namely, a binary tree in which each level is lled in from left to right, and must be full before the next level is started. In addition, a special ordering constraint is enforced: the key value of any node of the tree is less than or equal to that of its children. In particular, therefore, the root always contains the smallest element. See Figure 4.11(a) for an example. To insert, place the new element at the bottom of the tree (in the rst available position), and let it bubble up. That is, if it is smaller than its parent, swap the two and repeat (Figure 4.11(b) (d)). The number of swaps is at most the height of the tree, which isblog2nc when there are n elements. A decreasekey is similar, except that the element is already in the tree, so we let it bubble up from its current position. To deletemin, return the root value. To then remove this element from the heap, take the last node in the tree (in the rightmost position in the bottom row) and place it at the root. Let it sift down : if it is bigger than either child, swap it with the smaller child and repeat (Figure 4.11(e) (g)). Again this takes O(logn) time. The regularity of a complete binary tree makes it easy to represent using an array. The tree nodes have a natural ordering: row by row, starting at the root and moving left to right within each row. If there are n nodes, this ordering speci es their positions 1;2;:::;n within the array. Moving up and down the tree is easily simulated on the array, using the fact that node number j has parentbj=2cand children 2j and 2j + 1 (Exercise 4.16). 4.5.3 d-ary heap A d-ary heap is identical to a binary heap, except that nodes have d children instead of just two. This reduces the height of a tree withn elements to (logdn) = ((logn)=(logd)). Inserts are therefore speeded up by a factor of (logd). Deletemin operations, however, take a little longer, namely O(dlogdn) (do you see why?). The array representation of a binary heap is easily extended to the d-ary case. This time, node number j has parentd(j 1)=deand childrenf(j 1)d+ 2;:::;minfn;(j 1)d+d+ 1gg (Exercise 4.16). S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 127 Figure 4.11 (a) A binary heap with 10 elements. Only the key values are shown. (b) (d) The intermediate bubble-up steps in inserting an element with key 7. (e) (g) The sift-down steps in a delete-min operation. (a) 3 510 1211 6 8 15 20 13 (b) 3 510 1211 6 8 15 20 13 7 (c) 3 510 11 6 8 15 20 13 12 7 (d) 3 5 11 6 8 15 20 13 12 7 10 (e) 5 11 6 8 15 20 13 12 7 10 (f) 5 11 6 8 15 20 13 7 10 12 (g) 11 8 15 20 13 7 10 6 5 12 (h) 11 8 15 20 13 7 10 5 6 12 128 Algorithms Figure 4.12 Dijkstra’s algorithm will not work if there are negative edges. S A B −2 3 4 4.6 Shortest paths in the presence of negative edges 4.6.1 Negative edges Dijkstra’s algorithm works in part because the shortest path from the starting point s to any node v must pass exclusively through nodes that are closer than v. This no longer holds when edge lengths can be negative. In Figure 4.12, the shortest path from S to A passes through B, a node that is further away! What needs to be changed in order to accommodate this new complication? To answer this, let’s take a particular high-level view of Dijkstra’s algorithm. A crucial invariant is that the dist values it maintains are always either overestimates or exactly correct. They start off at 1, and the only way they ever change is by updating along an edge: procedure update((u;v)2E) dist(v) = minfdist(v);dist(u) +l(u;v)g This update operation is simply an expression of the fact that the distance to v cannot possibly be more than the distance to u, plus l(u;v). It has the following properties. 1. It gives the correct distance to v in the particular case where u is the second-last node in the shortest path to v, and dist(u) is correctly set. 2. It will never make dist(v) too small, and in this sense it is safe. For instance, a slew of extraneous update’s can’t hurt. This operation is extremely useful: it is harmless, and if used carefully, will correctly set distances. In fact, Dijkstra’s algorithm can be thought of simply as a sequence of update’s. We know this particular sequence doesn’t work with negative edges, but is there some other sequence that does? To get a sense of the properties this sequence must possess, let’s pick a node t and look at the shortest path to it from s. a0a1 a2 a2a3 a4a5 a6a7 a8a9 a10 a10a11 ts u1 u2 u3 uk This path can have at most jVj 1 edges (do you see why?). If the sequence of updates per- formed includes (s;u1);(u1;u2);(u2;u3);:::;(uk;t), in that order (though not necessarily con- secutively), then by the rst property the distance to t will be correctly computed. It doesn’t S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 129 Figure 4.13 The Bellman-Ford algorithm for single-source shortest paths in general graphs. procedure shortest-paths(G;l;s) Input: Directed graph G = (V;E); edge lengths fle : e2Eg with no negative cycles; vertex s2V Output: For all vertices u reachable from s, dist(u) is set to the distance from s to u. for all u2V: dist(u) =1 prev(u) = nil dist(s) = 0 repeat jVj 1 times: for all e2E: update(e) matter what other updates occur on these edges, or what happens in the rest of the graph, because updates are safe. But still, if we don’t know all the shortest paths beforehand, how can we be sure to update the right edges in the right order? Here is an easy solution: simply update all the edges, jVj 1 times! The resulting O(jVj jEj) procedure is called the Bellman-Ford algorithm and is shown in Figure 4.13, with an example run in Figure 4.14. A note about implementation: for many graphs, the maximum number of edges in any shortest path is substantially less thanjVj 1, with the result that fewer rounds of updates are needed. Therefore, it makes sense to add an extra check to the shortest-path algorithm, to make it terminate immediately after any round in which no update occurred. 4.6.2 Negative cycles If the length of edge (E;B) in Figure 4.14 were changed to 4, the graph would have a negative cycle A!E!B!A. In such situations, it doesn’t make sense to even ask about shortest paths. There is a path of length 2 from A to E. But going round the cycle, there’s also a path of length 1, and going round multiple times, we nd paths of lengths 0; 1; 2, and so on. The shortest-path problem is ill-posed in graphs with negative cycles. As might be ex- pected, our algorithm from Section 4.6.1 works only in the absence of such cycles. But where did this assumption appear in the derivation of the algorithm? Well, it slipped in when we asserted the existence of a shortest path from s to t. Fortunately, it is easy to automatically detect negative cycles and issue a warning. Such a cycle would allow us to endlessly apply rounds of updateoperations, reducing distestimates every time. So instead of stopping afterjVj 1 iterations, perform one extra round. There is a negative cycle if and only if some dist value is reduced during this nal round. 130 Algorithms Figure 4.14 The Bellman-Ford algorithm illustrated on a sample graph. E B A G F D S C 3 1 1 2 2 10 1 1 4 1 8 Iteration Node 0 1 2 3 4 5 6 7 S 0 0 0 0 0 0 0 0 A 1 10 10 5 5 5 5 5 B 1 1 1 10 6 5 5 5 C 1 1 1 1 11 7 6 6 D 1 1 1 1 1 14 10 9 E 1 1 12 8 7 7 7 7 F 1 1 9 9 9 9 9 9 G 1 8 8 8 8 8 8 8 4.7 Shortest paths in dags There are two subclasses of graphs that automatically exclude the possibility of negative cy- cles: graphs without negative edges, and graphs without cycles. We already know how to ef ciently handle the former. We will now see how the single-source shortest-path problem can be solved in just linear time on directed acyclic graphs. As before, we need to perform a sequence of updates that includes every shortest path as a subsequence. The key source of ef ciency is that In any path of a dag, the vertices appear in increasing linearized order. Therefore, it is enough to linearize (that is, topologically sort) the dag by depth- rst search, and then visit the vertices in sorted order, updating the edges out of each. The algorithm is given in Figure 4.15. Notice that our scheme doesn’t require edges to be positive. In particular, we can nd longest paths in a dag by the same algorithm: just negate all edge lengths. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 131 Figure 4.15 A single-source shortest-path algorithm for directed acyclic graphs. procedure dag-shortest-paths(G;l;s) Input: Dag G = (V;E); edge lengths fle : e2Eg; vertex s2V Output: For all vertices u reachable from s, dist(u) is set to the distance from s to u. for all u2V: dist(u) =1 prev(u) = nil dist(s) = 0 Linearize G for each u2V, in linearized order: for all edges (u;v)2E: update(u;v) 132 Algorithms Exercises 4.1. Suppose Dijkstra’s algorithm is run on the following graph, starting at node A. A B C D E F G H 1 2 41268 5 64 1 1 1 (a) Draw a table showing the intermediate distance values of all the nodes at each iteration of the algorithm. (b) Show the nal shortest-path tree. 4.2. Just like the previous problem, but this time with the Bellman-Ford algorithm. B G H I C D F E S A 7 1 −4 6 5 3 −2 3 2 −2 6 4 −2 1 −1 1 4.3. Squares. Design and analyze an algorithm that takes as input an undirected graph G = (V;E) and determines whether G contains a simple cycle (that is, a cycle which doesn’t intersect itself) of length four. Its running time should be at most O(jVj3). You may assume that the input graph is represented either as an adjacency matrix or with adjacency lists, whichever makes your algorithm simpler. 4.4. Here’s a proposal for how to nd the length of the shortest cycle in an undirected graph with unit edge lengths. When a back edge, say (v;w), is encountered during a depth- rst search, it forms a cycle with the tree edges from w to v. The length of the cycle is level[v] level[w] + 1, where the level of a vertex is its distance in the DFS tree from the root vertex. This suggests the following algorithm: Do a depth- rst search, keeping track of the level of each vertex. Each time a back edge is encountered, compute the cycle length and save it if it is smaller than the shortest one previously seen. Show that this strategy does not always work by providing a counterexample as well as a brief (one or two sentence) explanation. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 133 4.5. Often there are multiple shortest paths between two nodes of a graph. Give a linear-time algo- rithm for the following task. Input: Undirected graph G = (V;E) with unit edge lengths; nodes u;v2V. Output: The number of distinct shortest paths from u to v. 4.6. Prove that for the array prev computed by Dijkstra’s algorithm, the edges fu;prev[u]g(for all u2V) form a tree. 4.7. You are given a directed graph G = (V;E) with (possibly negative) weighted edges, along with a speci c node s2V and a tree T = (V;E0);E0 E. Give an algorithm that checks whether T is a shortest-path tree for G with starting point s. Your algorithm should run in linear time. 4.8. Professor F. Lake suggests the following algorithm for nding the shortest path from node s to node t in a directed graph with some negative edges: add a large constant to each edge weight so that all the weights become positive, then run Dijkstra’s algorithm starting at nodes, and return the shortest path found to node t. Is this a valid method? Either prove that it works correctly, or give a counterexample. 4.9. Consider a directed graph in which the only negative edges are those that leaves; all other edges are positive. Can Dijkstra’s algorithm, started at s, fail on such a graph? Prove your answer. 4.10. You are given a directed graph with (possibly negative) weighted edges, in which the shortest path between any two vertices is guaranteed to have at most k edges. Give an algorithm that nds the shortest path between two vertices u and v in O(kjEj) time. 4.11. Give an algorithm that takes as input a directed graph with positive edge lengths, and returns the length of the shortest cycle in the graph (if the graph is acyclic, it should say so). Your algorithm should take time at most O(jVj3). 4.12. Give an O(jVj2) algorithm for the following task. Input: An undirected graph G = (V;E); edge lengths le > 0; an edge e2E. Output: The length of the shortest cycle containing edge e. 4.13. You are given a set of cities, along with the pattern of highways between them, in the form of an undirected graph G = (V;E). Each stretch of highway e2E connects two of the cities, and you know its length in miles, le. You want to get from city s to city t. There’s one problem: your car can only hold enough gas to cover L miles. There are gas stations in each city, but not between cities. Therefore, you can only take a route if every one of its edges has length le L. (a) Given the limitation on your car’s fuel tank capacity, show how to determine in linear time whether there is a feasible route from s to t. (b) You are now planning to buy a new car, and you want to know the minimum fuel tank capacity that is needed to travel from s to t. Give an O((jVj+jEj) logjVj) algorithm to determine this. 4.14. You are given a strongly connected directed graph G = (V;E) with positive edge weights along with a particular node v0 2V. Give an ef cient algorithm for nding shortest paths between all pairs of nodes, with the one restriction that these paths must all pass through v0. 4.15. Shortest paths are not always unique: sometimes there are two or more different paths with the minimum possible length. Show how to solve the following problem in O((jVj+jEj) logjVj) time. 134 Algorithms Input: An undirected graph G = (V;E); edge lengths le > 0; starting vertex s2V. Output: A Boolean array usp[ ]: for each node u, the entry usp[u] should be true if and only if there is a unique shortest path from s to u. (Note: usp[s] = true.) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 135 Figure 4.16 Operations on a binary heap. procedure insert(h;x) bubbleup(h;x;jhj+ 1) procedure decreasekey(h;x) bubbleup(h;x;h 1(x)) function deletemin(h) if jhj= 0: return null else: x = h(1) siftdown(h;h(jhj);1) return x function makeheap(S) h = empty array of size jSj for x2S: h(jhj+ 1) = x for i =jSj downto 1: siftdown(h;h(i);i) return h procedure bubbleup(h;x;i) (place element x in position i of h, and let it bubble up) p =di=2e while i6= 1 and key(h(p)) >key(x): h(i) = h(p); i = p; p =di=2e h(i) = x procedure siftdown(h;x;i) (place element x in position i of h, and let it sift down) c = minchild(h;i) while c6= 0 and key(h(c)) jhj: return 0 (no children) else: return arg minfkey(h(j)) : 2i j minfjhj;2i+ 1gg 136 Algorithms 4.16. Section 4.5.2 describes a way of storing a complete binary tree of n nodes in an array indexed by 1;2;:::;n. (a) Consider the node at position j of the array. Show that its parent is at position bj=2cand its children are at 2j and 2j + 1 (if these numbers are n). (b) What the corresponding indices when a complete d-ary tree is stored in an array? Figure 4.16 shows pseudocode for a binary heap, modeled on an exposition by R.E. Tarjan.2 The heap is stored as an array h, which is assumed to support two constant-time operations: jhj, which returns the number of elements currently in the array; h 1, which returns the position of an element within the array. The latter can always be achieved by maintaining the values of h 1 as an auxiliary array. (c) Show that the makeheap procedure takes O(n) time when called on a set of n elements. What is the worst-case input? (Hint: Start by showing that the running time is at mostP n i=1 log(n=i).) (a) What needs to be changed to adapt this pseudocode to d-ary heaps? 4.17. Suppose we want to run Dijkstra’s algorithm on a graph whose edge weights are integers in the range 0;1;:::;W, where W is a relatively small number. (a) Show how Dijkstra’s algorithm can be made to run in time O(WjVj+jEj). (b) Show an alternative implementation that takes time just O((jVj+jEj) logW). 4.18. In cases where there are several different shortest paths between two nodes (and edges have varying lengths), the most convenient of these paths is often the one with fewest edges. For instance, if nodes represent cities and edge lengths represent costs of ying between cities, there might be many ways to get from citys to city twhich all have the same cost. The most convenient of these alternatives is the one which involves the fewest stopovers. Accordingly, for a speci c starting node s, de ne best[u] = minimum number of edges in a shortest path from s to u: In the example below, thebestvalues for nodesS;A;B;C;D;E;F are 0;1;1;1;2;2;3, respectively. S A B C D E F 2 2 4 3 2 2 1 1 1 1 Give an ef cient algorithm for the following problem. Input: Graph G = (V;E); positive edge lengths le; starting node s2V. Output: The values of best[u] should be set for all nodes u2V. 2See: R. E. Tarjan, Data Structures and Network Algorithms, Society for Industrial and Applied Mathematics, 1983. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 137 4.19. Generalized shortest-paths problem. In Internet routing, there are delays on lines but also, more signi cantly, delays at routers. This motivates a generalized shortest-paths problem. Suppose that in addition to having edge lengths fle : e 2 Eg, a graph also has vertex costs fcv : v 2Vg. Now de ne the cost of a path to be the sum of its edge lengths, plus the costs of all vertices on the path (including the endpoints). Give an ef cient algorithm for the following problem. Input: A directed graph G = (V;E); positive edge lengths le and positive vertex costs cv; a starting vertex s2V. Output: An array cost[ ] such that for every vertex u, cost[u] is the least cost of any path from s to u (i.e., the cost of the cheapest path), under the de nition above. Notice that cost[s] = cs. 4.20. There is a network of roads G = (V;E) connecting a set of cities V. Each road in E has an associated length le. There is a proposal to add one new road to this network, and there is a list E0 of pairs of cities between which the new road can be built. Each such potential roade02E0 has an associated length. As a designer for the public works department you are asked to determine the road e02E0 whose addition to the existing networkG would result in the maximum decrease in the driving distance between two xed citiessandtin the network. Give an ef cient algorithm for solving this problem. 4.21. Shortest path algorithms can be applied in currency trading. Let c1;c2;:::;cn be various cur- rencies; for instance, c1 might be dollars, c2 pounds, and c3 lire. For any two currencies ci and cj, there is an exchange rate ri;j; this means that you can purchase ri;j units of currency cj in exchange for one unit of ci. These exchange rates satisfy the condition that ri;j rj;i < 1, so that if you start with a unit of currency ci, change it into currency cj and then convert back to currency ci, you end up with less than one unit of currencyci (the difference is the cost of the transaction). (a) Give an ef cient algorithm for the following problem: Given a set of exchange rates ri;j, and two currencies s and t, nd the most advantageous sequence of currency exchanges for converting currencysinto currencyt. Toward this goal, you should represent the currencies and rates by a graph whose edge lengths are real numbers. The exchange rates are updated frequently, re ecting the demand and supply of the various currencies. Occasionally the exchange rates satisfy the following property: there is a sequence of currencies ci1;ci2;:::;cik such that ri1;i2 ri2;i3 rik 1;ik rik;i1 > 1. This means that by starting with a unit of currency ci1 and then successively converting it to currencies ci2;ci3;:::;cik , and nally back to ci1, you would end up with more than one unit of currency ci1. Such anomalies last only a fraction of a minute on the currency exchange, but they provide an opportunity for risk-free pro ts. (b) Give an ef cient algorithm for detecting the presence of such an anomaly. Use the graph representation you found above. 4.22. The tramp steamer problem. You are the owner of a steamship that can ply between a group of port cities V. You make money at each port: a visit to city i earns you a pro t of pi dollars. Meanwhile, the transportation cost from port i to port j is cij > 0. You want to nd a cyclic route in which the ratio of pro t to cost is maximized. 138 Algorithms To this end, consider a directed graph G = (V;E) whose nodes are ports, and which has edges between each pair of ports. For any cycle C in this graph, the pro t-to-cost ratio is r(C) = P (i;j)2C pjP (i;j)2C cij : Let r be the maximum ratio achievable by a simple cycle. One way to determine r is by binary search: by rst guessing some ratio r, and then testing whether it is too large or too small. Consider any positive r> 0. Give each edge (i;j) a weight of wij = rcij pj. (a) Show that if there is a cycle of negative weight, then rr . (c) Give an ef cient algorithm that takes as input a desired accuracy > 0 and returns a simple cycle C for which r(C) r . Justify the correctness of your algorithm and analyze its running time in terms ofjVj, , and R = max(i;j)2E(pj=cij). Chapter 5 Greedy algorithms A game like chess can be won only by thinking ahead: a player who is focused entirely on immediate advantage is easy to defeat. But in many other games, such as Scrabble, it is possible to do quite well by simply making whichever move seems best at the moment and not worrying too much about future consequences. This sort of myopic behavior is easy and convenient, making it an attractive algorithmic strategy. Greedy algorithms build up a solution piece by piece, always choosing the next piece that offers the most obvious and immediate bene t. Although such an approach can be disastrous for some computational tasks, there are many for which it is optimal. Our rst example is that of minimum spanning trees. 5.1 Minimum spanning trees Suppose you are asked to network a collection of computers by linking selected pairs of them. This translates into a graph problem in which nodes are computers, undirected edges are potential links, and the goal is to pick enough of these edges that the nodes are connected. But this is not all; each link also has a maintenance cost, re ected in that edge’s weight. What is the cheapest possible network? A B C D E F 4 1 4 3 4 2 5 6 4 One immediate observation is that the optimal set of edges cannot contain a cycle, because removing an edge from this cycle would reduce the cost without compromising connectivity: Property 1 Removing a cycle edge cannot disconnect a graph. So the solution must be connected and acyclic: undirected graphs of this kind are called trees. The particular tree we want is the one with minimum total weight, known as the minimum spanning tree. Here is its formal de nition. 139 140 Algorithms Input: An undirected graph G = (V;E); edge weights we. Output: A tree T = (V;E0), with E0 E, that minimizes weight(T) = X e2E0 we: In the preceding example, the minimum spanning tree has a cost of 16: A B C D E F 1 4 2 5 4 However, this is not the only optimal solution. Can you spot another? 5.1.1 A greedy approach Kruskal’s minimum spanning tree algorithm starts with the empty graph and then selects edges from E according to the following rule. Repeatedly add the next lightest edge that doesn’t produce a cycle. In other words, it constructs the tree edge by edge and, apart from taking care to avoid cycles, simply picks whichever edge is cheapest at the moment. This is a greedy algorithm: every decision it makes is the one with the most obvious immediate advantage. Figure 5.1 shows an example. We start with an empty graph and then attempt to add edges in increasing order of weight (ties are broken arbitrarily): B C; C D; B D; C F; D F; E F; A D; A B; C E; A C: The rst two succeed, but the third, B D, would produce a cycle if added. So we ignore it and move along. The nal result is a tree with cost 14, the minimum possible. The correctness of Kruskal’s method follows from a certain cut property, which is general enough to also justify a whole slew of other minimum spanning tree algorithms. Figure 5.1 The minimum spanning tree found by Kruskal’s algorithm. B A 6 5 3 42 FD C E 5 41 24 B A FD C E S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 141 Trees A tree is an undirected graph that is connected and acyclic. Much of what makes trees so useful is the simplicity of their structure. For instance, Property 2 A tree on n nodes has n 1 edges. This can be seen by building the tree one edge at a time, starting from an empty graph. Initially each of the n nodes is disconnected from the others, in a connected component by itself. As edges are added, these components merge. Since each edge unites two different components, exactly n 1 edges are added by the time the tree is fully formed. In a little more detail: When a particular edge fu;vg comes up, we can be sure that u and v lie in separate connected components, for otherwise there would already be a path between them and this edge would create a cycle. Adding the edge then merges these two components, thereby reducing the total number of connected components by one. Over the course of this incremental process, the number of components decreases from n to one, meaning that n 1 edges must have been added along the way. The converse is also true. Property 3 Any connected, undirected graph G = (V;E) withjEj=jVj 1 is a tree. We just need to show that G is acyclic. One way to do this is to run the following iterative procedure on it: while the graph contains a cycle, remove one edge from this cycle. The process terminates with some graph G0 = (V;E0);E0 E, which is acyclic and, by Property 1 (from page 139), is also connected. Therefore G0 is a tree, whereupon jE0j = jVj 1 by Property 2. So E0 = E, no edges were removed, and G was acyclic to start with. In other words, we can tell whether a connected graph is a tree just by counting how many edges it has. Here’s another characterization. Property 4 An undirected graph is a tree if and only if there is a unique path between any pair of nodes. In a tree, any two nodes can only have one path between them; for if there were two paths, the union of these paths would contain a cycle. On the other hand, if a graph has a path between any two nodes, then it is connected. If these paths are unique, then the graph is also acyclic (since a cycle has two paths between any pair of nodes). 142 Algorithms Figure 5.2 T[feg. The addition of e (dotted) to T (solid lines) produces a cycle. This cycle must contain at least one other edge, shown here as e0, across the cut (S;V S). a0a1 a2a3 a4 a4a5 a6 a6a7 a7 a8a9a10a11 a12 a12a13 a13 a14a15 a16a17 a18a19 a20a21 a22a23 a24a25 a26 a26a27 a28 a28a29 e S V S e0 5.1.2 The cut property Say that in the process of building a minimum spanning tree (MST), we have already chosen some edges and are so far on the right track. Which edge should we add next? The following lemma gives us a lot of exibility in our choice. Cut property Suppose edges X are part of a minimum spanning tree of G = (V;E). Pick any subset of nodes S for which X does not cross between S and V S, and let e be the lightest edge across this partition. Then X[fegis part of some MST. A cut is any partition of the vertices into two groups, S and V S. What this property says is that it is always safe to add the lightest edge across any cut (that is, between a vertex in S and one in V S), provided X has no edges across the cut. Let’s see why this holds. Edges X are part of some MST T; if the new edge e also happens to be part of T, then there is nothing to prove. So assume e is not in T. We will construct a different MST T0 containing X[fegby altering T slightly, changing just one of its edges. Add edge e to T. Since T is connected, it already has a path between the endpoints of e, so adding e creates a cycle. This cycle must also have some other edge e0 across the cut (S;V S) (Figure 8.3). If we now remove this edge, we are left with T0 = T[feg fe0g, which we will show to be a tree. T0 is connected by Property 1, since e0 is a cycle edge. And it has the same number of edges as T; so by Properties 2 and 3, it is also a tree. Moreover, T0 is a minimum spanning tree. Compare its weight to that of T: weight(T0) = weight(T) +w(e) w(e0): Both e and e0 cross between S and V S, and e is speci cally the lightest edge of this type. Therefore w(e) w(e0), and weight(T0) weight(T). Since T is an MST, it must be the case that weight(T0) = weight(T) and that T0 is also an MST. Figure 5.3 shows an example of the cut property. Which edge is e0? S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 143 Figure 5.3 The cut property at work. (a) An undirected graph. (b) Set X has three edges, and is part of the MST T on the right. (c) If S = fA;B;C;Dg, then one of the minimum-weight edges across the cut (S;V S) is e =fD;Eg. X[fegis part of MST T0, shown on the right. (a) A B C E FD 2 2 3 3 41 1 2 1 (b) Edges X: A B C E FD MST T: A B C E FD (c) The cut: A B C E FD e S V S MST T0: A B C E FD 5.1.3 Kruskal’s algorithm We are ready to justify Kruskal’s algorithm. At any given moment, the edges it has already chosen form a partial solution, a collection of connected components each of which has a tree structure. The next edge e to be added connects two of these components; call them T1 and T2. Since e is the lightest edge that doesn’t produce a cycle, it is certain to be the lightest edge between T1 and V T1 and therefore satis es the cut property. Now we ll in some implementation details. At each stage, the algorithm chooses an edge to add to its current partial solution. To do so, it needs to test each candidate edge u v to see whether the endpoints u and v lie in different components; otherwise the edge produces a cycle. And once an edge is chosen, the corresponding components need to be merged. What kind of data structure supports such operations? We will model the algorithm’s state as a collection of disjoint sets, each of which contains the nodes of a particular component. Initially each node is in a component by itself: makeset(x): create a singleton set containing just x. We repeatedly test pairs of nodes to see if they belong to the same set. find(x): to which set does x belong? 144 Algorithms Figure 5.4 Kruskal’s minimum spanning tree algorithm. procedure kruskal(G;w) Input: A connected undirected graph G = (V;E) with edge weights we Output: A minimum spanning tree defined by the edges X for all u2V: makeset(u) X =fg Sort the edges E by weight for all edges fu;vg2E, in increasing order of weight: if find(u)6= find(v): add edge fu;vg to X union(u;v) And whenever we add an edge, we are merging two components. union(x;y): merge the sets containing x and y. The nal algorithm is shown in Figure 5.4. It uses jVjmakeset, 2jEjfind, and jVj 1 union operations. 5.1.4 A data structure for disjoint sets Union by rank One way to store a set is as a directed tree (Figure 5.5). Nodes of the tree are elements of the set, arranged in no particular order, and each has parent pointers that eventually lead up to the root of the tree. This root element is a convenient representative, or name, for the set. It is distinguished from the other elements by the fact that its parent pointer is a self-loop. Figure 5.5 A directed-tree representation of two setsfB;EgandfA;C;D;F;G;Hg. E H B C F A D G S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 145 In addition to a parent pointer , each node also has a rank that, for the time being, should be interpreted as the height of the subtree hanging from that node. procedure makeset(x) (x) = x rank(x) = 0 function find(x) while x6= (x) : x = (x) return x As can be expected, makeset is a constant-time operation. On the other hand, find follows parent pointers to the root of the tree and therefore takes time proportional to the height of the tree. The tree actually gets built via the third operation, union, and so we must make sure that this procedure keeps trees shallow. Merging two sets is easy: make the root of one point to the root of the other. But we have a choice here. If the representatives (roots) of the sets are rx and ry, do we make rx point to ry or the other way around? Since tree height is the main impediment to computational ef ciency, a good strategy is to make the root of the shorter tree point to the root of the taller tree. This way, the overall height increases only if the two trees being merged are equally tall. Instead of explicitly computing heights of trees, we will use the rank numbers of their root nodes which is why this scheme is called union by rank. procedure union(x;y) rx = find(x) ry = find(y) if rx = ry: return if rank(rx) >rank(ry): (ry) = rx else: (rx) = ry if rank(rx) = rank(ry) : rank(ry) = rank(ry) + 1 See Figure 5.6 for an example. By design, the rank of a node is exactly the height of the subtree rooted at that node. This means, for instance, that as you move up a path toward a root node, the rank values along the way are strictly increasing. Property 1 For any x, rank(x) < rank( (x)). A root node with rank k is created by the merger of two trees with roots of rank k 1. It follows by induction (try it!) that Property 2 Any root node of rank k has at least 2k nodes in its tree. 146 Algorithms This extends to internal (nonroot) nodes as well: a node of rank k has at least 2k de- scendants. After all, any internal node was once a root, and neither its rank nor its set of descendants has changed since then. Moreover, different rank-k nodes cannot have common descendants, since by Property 1 any element has at most one ancestor of rank k. Which means Property 3 If there are n elements overall, there can be at most n=2k nodes of rank k. This last observation implies, crucially, that the maximum rank is logn. Therefore, all the trees have height logn, and this is an upper bound on the running time of find and union. Figure 5.6 A sequence of disjoint-set operations. Superscripts denote rank. After makeset(A);makeset(B);:::;makeset(G): A0 B0 C0 D0 E0 F0 0G After union(A;D);union(B;E);union(C;F): A0 B0 C0 G0F1E1D1 After union(C;G);union(E;A): B 1 F1 C 0G 0 E D2 A0 0 After union(B;G): A G0 FE1 0 C0 D2 B0 1 S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 147 Path compression With the data structure as presented so far, the total time for Kruskal’s algorithm becomes O(jEjlogjVj) for sorting the edges (remember, logjEj logjVj) plus another O(jEjlogjVj) for the union and find operations that dominate the rest of the algorithm. So there seems to be little incentive to make our data structure any more ef cient. But what if the edges are given to us sorted? Or if the weights are small (say, O(jEj)) so that sorting can be done in linear time? Then the data structure part becomes the bottleneck, and it is useful to think about improving its performance beyond logn per operation. As it turns out, the improved data structure is useful in many other applications. But how can we perform union’s and find’s faster than logn? The answer is, by being a little more careful to maintain our data structure in good shape. As any housekeeper knows, a little extra effort put into routine maintenance can pay off handsomely in the long run, by forestalling major calamities. We have in mind a particular maintenance operation for our union- nd data structure, intended to keep the trees short during each find, when a series of parent pointers is followed up to the root of a tree, we will change all these pointers so that they point directly to the root (Figure 5.7). This path compression heuristic only slightly increases the time needed for a find and is easy to code. function find(x) if x6= (x) : (x) = find( (x)) return (x) The bene t of this simple alteration is long-term rather than instantaneous and thus neces- sitates a particular kind of analysis: we need to look at sequences of find and union opera- tions, starting from an empty data structure, and determine the average time per operation. This amortized cost turns out to be just barely more than O(1), down from the earlier O(logn). Think of the data structure as having a top level consisting of the root nodes, and below it, the insides of the trees. There is a division of labor: find operations (with or without path compression) only touch the insides of trees, whereas union’s only look at the top level. Thus path compression has no effect on union operations and leaves the top level unchanged. We now know that the ranks of root nodes are unaltered, but what about nonroot nodes? The key point here is that once a node ceases to be a root, it never resurfaces, and its rank is forever xed. Therefore the ranks of all nodes are unchanged by path compression, even though these numbers can no longer be interpreted as tree heights. In particular, properties 1 3 (from page 145) still hold. If there are n elements, their rank values can range from 0 to logn by Property 3. Let’s divide the nonzero part of this range into certain carefully chosen intervals, for reasons that will soon become clear: f1g;f2g;f3;4g;f5;6;:::;16g;f17;18;:::;216 = 65536g;f65537;65538;:::;265536g;::: Each group is of the formfk + 1;k + 2;:::;2kg, where k is a power of 2. The number of groups is log n, which is de ned to be the number of successive log operations that need to be applied 148 Algorithms Figure 5.7 The effect of path compression: find(I) followed by find(K). B0 D0 I0 J0 K0 H0 C1 1 G1 A3 F E2 ! B0 0D K0 J0 I0 H0 C1 F1 G1 A3 E2 ! B0 D H0 J 0 I0 K0 G1C1 F1E2 A 0 3 to n to bring it down to 1 (or below 1). For instance, log 1000 = 4 since log log log log 1000 1. In practice there will just be the rst ve of the intervals shown; more are needed only if n 265536, in other words never. In a sequence of find operations, some may take longer than others. We’ll bound the overall running time using some creative accounting. Speci cally, we will give each node a certain amount of pocket money, such that the total money doled out is at most nlog ndollars. We will then show that each find takes O(log n) steps, plus some additional amount of time that can be paid for using the pocket money of the nodes involved one dollar per unit of time. Thus the overall time for mfind’s is O(mlog n) plus at most O(nlog n). In more detail, a node receives its allowance as soon as it ceases to be a root, at which point its rank is xed. If this rank lies in the intervalfk + 1;:::;2kg, the node receives 2k dollars. By Property 3, the number of nodes with rank >k is bounded by n 2k+1 + n 2k+2 + n 2k: Therefore the total money given to nodes in this particular interval is at most n dollars, and since there are log n intervals, the total money disbursed to all nodes is nlog n. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 149 Now, the time taken by a speci c findis simply the number of pointers followed. Consider the ascending rank values along this chain of nodes up to the root. Nodes x on the chain fall into two categories: either the rank of (x) is in a higher interval than the rank of x, or else it lies in the same interval. There are at most log n nodes of the rst type (do you see why?), so the work done on them takes O(log n) time. The remaining nodes whose parents’ ranks are in the same interval as theirs have to pay a dollar out of their pocket money for their processing time. This only works if the initial allowance of each nodexis enough to cover all of its payments in the sequence of find operations. Here’s the crucial observation: each time x pays a dollar, its parent changes to one of higher rank. Therefore, if x’s rank lies in the interval fk + 1;:::;2kg, it has to pay at most 2k dollars before its parent’s rank is in a higher interval; whereupon it never has to pay again. 150 Algorithms A randomized algorithm for minimum cut We have already seen that spanning trees and cuts are intimately related. Here is another connection. Let’s remove the last edge that Kruskal’s algorithm adds to the spanning tree; this breaks the tree into two components, thus de ning a cut (S;S) in the graph. What can we say about this cut? Suppose the graph we were working with was unweighted, and that its edges were ordered uniformly at random for Kruskal’s algorithm to process them. Here is a remarkable fact: with probability at least 1=n2, (S;S) is the minimum cut in the graph, where the size of a cut (S;S) is the number of edges crossing between S and S. This means that repeating the process O(n2) times and outputting the smallest cut found yields the minimum cut in G with high probability: an O(mn2 logn) algorithm for unweighted minimum cuts. Some further tuning gives the O(n2 logn) minimum cut algorithm, invented by David Karger, which is the fastest known algorithm for this important problem. So let us see why the cut found in each iteration is the minimum cut with probability at least 1=n2. At any stage of Kruskal’s algorithm, the vertex setV is partitioned into connected components. The only edges eligible to be added to the tree have their two endpoints in distinct components. The number of edges incident to each component must be at least C, the size of the minimum cut in G (since we could consider a cut that separated this component from the rest of the graph). So if there are k components in the graph, the number of eligible edges is at least kC=2 (each of the k components has at least C edges leading out of it, and we need to compensate for the double-counting of each edge). Since the edges were randomly ordered, the chance that the next eligible edge in the list is from the minimum cut is at most C=(kC=2) = 2=k. Thus, with probability at least 1 2=k = (k 2)=k, the choice leaves the minimum cut intact. But now the chance that Kruskal’s algorithm leaves the minimum cut intact all the way up to the choice of the last spanning tree edge is at least n 2 n n 3 n 1 n 4 n 2 2 4 1 3 = 1 n(n 1): 5.1.5 Prim’s algorithm Let’s return to our discussion of minimum spanning tree algorithms. What the cut property tells us in most general terms is that any algorithm conforming to the following greedy schema is guaranteed to work. X =f g (edges picked so far) repeat until jXj=jVj 1: pick a set S V for which X has no edges between S and V S let e2E be the minimum-weight edge between S and V S X = X[feg A popular alternative to Kruskal’s algorithm is Prim’s, in which the intermediate set of edges X always forms a subtree, and S is chosen to be the set of this tree’s vertices. On each iteration, the subtree de ned by X grows by one edge, namely, the lightest edge between a vertex in S and a vertex outside S (Figure 5.8). We can equivalently think of S as S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 151 Figure 5.8 Prim’s algorithm: the edges X form a tree, and S consists of its vertices. a0a1 a2a3 a4 a4a5 a6 a6a7 a7 a8a9a10a11 a12 a12a13 a13 a14a15 a16a17 a18a19 a20 a20a21 a22a23 a24a25 a26a27 a28 a28a29 a30 a30a31 a31 a32 a32a33 a33 a34 a34 a34 a34 a35 a35 a35 a35 a36 a36a37 a37 e S V S X growing to include the vertex v62S of smallest cost: cost(v) = min u2S w(u;v): This is strongly reminiscent of Dijkstra’s algorithm, and in fact the pseudocode (Figure 5.9) is almost identical. The only difference is in the key values by which the priority queue is ordered. In Prim’s algorithm, the value of a node is the weight of the lightest incoming edge from set S, whereas in Dijkstra’s it is the length of an entire path to that node from the starting point. Nonetheless, the two algorithms are similar enough that they have the same running time, which depends on the particular priority queue implementation. Figure 5.9 shows Prim’s algorithm at work, on a small six-node graph. Notice how the nal MST is completely speci ed by the prev array. 152 Algorithms Figure 5.9 Top: Prim’s minimum spanning tree algorithm. Below: An illustration of Prim’s algorithm, starting at node A. Also shown are a table of cost/prevvalues, and the nal MST. procedure prim(G;w) Input: A connected undirected graph G = (V;E) with edge weights we Output: A minimum spanning tree defined by the array prev for all u2V: cost(u) =1 prev(u) = nil Pick any initial node u0 cost(u0) = 0 H = makequeue(V) (priority queue, using cost-values as keys) while H is not empty: v = deletemin(H) for each fv;zg2E: if cost(z) >w(v;z): cost(z) = w(v;z) prev(z) = v decreasekey(H;z) B A 6 5 3 42 FD C E 5 41 24 B A FD C E Set S A B C D E F fg 0=nil 1=nil 1=nil 1=nil 1=nil 1=nil A 5=A 6=A 4=A 1=nil 1=nil A;D 2=D 2=D 1=nil 4=D A;D;B 1=B 1=nil 4=D A;D;B;C 5=C 3=C A;D;B;C;F 4=F S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 153 5.2 Huffman encoding In the MP3 audio compression scheme, a sound signal is encoded in three steps. 1. It is digitized by sampling at regular intervals, yielding a sequence of real numbers s1;s2;:::;sT . For instance, at a rate of 44;100 samples per second, a 50-minute symphony would correspond to T = 50 60 44;100 130 million measurements.1 2. Each real-valued sample st is quantized: approximated by a nearby number from a nite set . This set is carefully chosen to exploit human perceptual limitations, with the intention that the approximating sequence is indistinguishable from s1;s2;:::;sT by the human ear. 3. The resulting string of length T over alphabet is encoded in binary. It is in the last step that Huffman encoding is used. To understand its role, let’s look at a toy example in which T is 130 million and the alphabet consists of just four values, denoted by the symbols A;B;C;D. What is the most economical way to write this long string in binary? The obvious choice is to use 2 bits per symbol say codeword 00 for A, 01 for B, 10 for C, and 11 for D in which case 260 megabits are needed in total. Can there possibly be a better encoding than this? In search of inspiration, we take a closer look at our particular sequence and nd that the four symbols are not equally abundant. Symbol Frequency A 70 million B 3 million C 20 million D 37 million Is there some sort of variable-length encoding, in which just one bit is used for the frequently occurring symbol A, possibly at the expense of needing three or more bits for less common symbols? A danger with having codewords of different lengths is that the resulting encoding may not be uniquely decipherable. For instance, if the codewords aref0;01;11;001g, the decoding of strings like 001 is ambiguous. We will avoid this problem by insisting on the pre x-free property: no codeword can be a pre x of another codeword. Any pre x-free encoding can be represented by a full binary tree that is, a binary tree in which every node has either zero or two children where the symbols are at the leaves, and where each codeword is generated by a path from root to leaf, interpreting left as 0 and right as 1 (Exercise 5.28). Figure 5.10 shows an example of such an encoding for the four symbols A;B;C;D. Decoding is unique: a string of bits is decrypted by starting at the root, reading the string from left to right to move downward, and, whenever a leaf is reached, outputting the corresponding symbol and returning to the root. It is a simple scheme and pays off nicely 1For stereo sound, two channels would be needed, doubling the number of samples. 154 Algorithms Figure 5.10 A pre x-free encoding. Frequencies are shown in square brackets. Symbol Codeword A 0 B 100 C 101 D 11 0 A [70] 1 [60] C [20]B [3] D [37] [23] for our toy example, where (under the codes of Figure 5.10) the total size of the binary string drops to 213 megabits, a 17% improvement. In general, how do we nd the optimal coding tree, given the frequencies f1;f2;:::;fn of n symbols? To make the problem precise, we want a tree whose leaves each correspond to a symbol and which minimizes the overall length of the encoding, cost of tree = nX i=1 fi (depth of ith symbol in tree) (the number of bits required for a symbol is exactly its depth in the tree). There is another way to write this cost function that is very helpful. Although we are only given frequencies for the leaves, we can de ne the frequency of any internal node to be the sum of the frequencies of its descendant leaves; this is, after all, the number of times the internal node is visited during encoding or decoding. During the encoding process, each time we move down the tree, one bit gets output for every nonroot node through which we pass. So the total cost the total number of bits which are output can also be expressed thus: The cost of a tree is the sum of the frequencies of all leaves and internal nodes, except the root. The rst formulation of the cost function tells us that the two symbols with the smallest frequencies must be at the bottom of the optimal tree, as children of the lowest internal node (this internal node has two children since the tree is full). Otherwise, swapping these two symbols with whatever is lowest in the tree would improve the encoding. This suggests that we start constructing the tree greedily: nd the two symbols with the smallest frequencies, say i and j, and make them children of a new node, which then has frequency fi + fj. To keep the notation simple, let’s just assume these are f1 and f2. By the second formulation of the cost function, any tree in which f1 and f2 are sibling-leaves has cost f1 +f2 plus the cost for a tree with n 1 leaves of frequencies (f1 +f2);f3;f4;:::;fn: S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 155 f1 f2 f3f5 f4 f1 + f2 The latter problem is just a smaller version of the one we started with. So we pull f1 and f2 off the list of frequencies, insert (f1 +f2), and loop. The resulting algorithm can be described in terms of priority queue operations (as de ned on page 120) and takes O(nlogn) time if a binary heap (Section 4.5.2) is used. procedure Huffman(f) Input: An array f[1 n] of frequencies Output: An encoding tree with n leaves let H be a priority queue of integers, ordered by f for i = 1 to n: insert(H;i) for k = n+ 1 to 2n 1: i = deletemin(H); j = deletemin(H) create a node numbered k with children i;j f[k] = f[i] +f[j] insert(H;k) Returning to our toy example: can you tell if the tree of Figure 5.10 is optimal? 156 Algorithms Entropy The annual county horse race is bringing in three thoroughbreds who have never competed against one another. Excited, you study their past 200 races and summarize these as prob- ability distributions over four outcomes: first ( rst place ), second, third, and other. Outcome Aurora Whirlwind Phantasm first 0.15 0.30 0.20 second 0.10 0.05 0.30 third 0.70 0.25 0.30 other 0.05 0.40 0.20 Which horse is the most predictable? One quantitative approach to this question is to look at compressibility. Write down the history of each horse as a string of 200 values (first, second, third, other). The total number of bits needed to encode these track- record strings can then be computed using Huffman’s algorithm. This works out to 290 bits for Aurora, 380 for Whirlwind, and 420 for Phantasm (check it!). Aurora has the shortest encoding and is therefore in a strong sense the most predictable. The inherent unpredictability, or randomness, of a probability distribution can be mea- sured by the extent to which it is possible to compress data drawn from that distribution. more compressible less random more predictable Suppose there are n possible outcomes, with probabilities p1;p2;:::;pn. If a sequence of m values is drawn from the distribution, then the ith outcome will pop up roughly mpi times (if m is large). For simplicity, assume these are exactly the observed frequencies, and moreover that the pi’s are all powers of 2 (that is, of the form 1=2k). It can be seen by induction (Exercise 5.19) that the number of bits needed to encode the sequence is Pni=1mpi log(1=pi). Thus the average number of bits needed to encode a single draw from the distribution is nX i=1 pi log 1p i : This is the entropy of the distribution, a measure of how much randomness it contains. For example, a fair coin has two outcomes, each with probability 1=2. So its entropy is 1 2 log 2 + 1 2 log 2 = 1: This is natural enough: the coin ip contains one bit of randomness. But what if the coin is not fair, if it has a 3=4 chance of turning up heads? Then the entropy is 3 4 log 4 3 + 1 4 log 4 = 0:81: A biased coin is more predictable than a fair coin, and thus has lower entropy. As the bias becomes more pronounced, the entropy drops toward zero. We explore these notions further in Exercise 5.18 and 5.19. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 157 5.3 Horn formulas In order to display human-level intelligence, a computer must be able to perform at least some modicum of logical reasoning. Horn formulas are a particular framework for doing this, for expressing logical facts and deriving conclusions. The most primitive object in a Horn formula is a Boolean variable, taking value either true or false. For instance, variables x, y, and z might denote the following possibilities. x the murder took place in the kitchen y the butler is innocent z the colonel was asleep at 8 pm A literal is either a variable x or its negation x ( NOT x ). In Horn formulas, knowledge about variables is represented by two kinds of clauses: 1. Implications, whose left-hand side is an AND of any number of positive literals and whose right-hand side is a single positive literal. These express statements of the form if the conditions on the left hold, then the one on the right must also be true. For instance, (z^w))u might mean if the colonel was asleep at 8 pm and the murder took place at 8 pm then the colonel is innocent. A degenerate type of implication is the singleton )x, meaning simply that x is true: the murder de nitely occurred in the kitchen. 2. Pure negative clauses, consisting of an OR of any number of negative literals, as in (u_v_y) ( they can’t all be innocent ). Given a set of clauses of these two types, the goal is to determine whether there is a consis- tent explanation: an assignment of true/false values to the variables that satis es all the clauses. This is also called a satisfying assignment. The two kinds of clauses pull us in different directions. The implications tell us to set some of the variables to true, while the negative clauses encourage us to make them false. Our strategy for solving a Horn formula is this: We start with all variables false. We then proceed to set some of them to true, one by one, but very reluctantly, and only if we absolutely have to because an implication would otherwise be violated. Once we are done with this phase and all implications are satis ed, only then do we turn to the negative clauses and make sure they are all satis ed. In other words, our algorithm for Horn clauses is the following greedy scheme (stingy is perhaps more descriptive): Input: a Horn formula Output: a satisfying assignment, if one exists 158 Algorithms set all variables to false while there is an implication that is not satisfied: set the right-hand variable of the implication to true if all pure negative clauses are satisfied: return the assignment else: return ‘‘formula is not satisfiable’’ For instance, suppose the formula is (w^y^z))x; (x^z))w; x)y; )x; (x^y))w; (w_x_y); (z): We start with everything false and then notice that x must be true on account of the sin- gleton implication ) x. Then we see that y must also be true, because of x ) y. And so on. To see why the algorithm is correct, notice that if it returns an assignment, this assign- ment satis es both the implications and the negative clauses, and so it is indeed a satisfying truth assignment of the input Horn formula. So we only have to convince ourselves that if the algorithm nds no satisfying assignment, then there really is none. This is so because our stingy rule maintains the following invariant: If a certain set of variables is set to true, then they must be true in any satisfying assignment. Hence, if the truth assignment found after the while loop does not satisfy the negative clauses, there can be no satisfying truth assignment. Horn formulas lie at the heart of Prolog ( programming by logic ), a language in which you program by specifying desired properties of the output, using simple logical expressions. The workhorse of Prolog interpreters is our greedy satis ability algorithm. Conveniently, it can be implemented in time linear in the length of the formula; do you see how (Exercise 5.32)? 5.4 Set cover The dots in Figure 5.11 represent a collection of towns. This county is in its early stages of planning and is deciding where to put schools. There are only two constraints: each school should be in a town, and no one should have to travel more than 30 miles to reach one of them. What is the minimum number of schools needed? This is a typical set cover problem. For each town x, let Sx be the set of towns within 30 miles of it. A school at x will essentially cover these other towns. The question is then, how many sets Sx must be picked in order to cover all the towns in the county? SET COVER Input: A set of elements B; sets S1;:::;Sm B S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 159 Figure 5.11 (a) Eleven towns. (b) Towns that are within 30 miles of each other. (a) h b k j i g f ea c d (b) h b k j i g f ea c d Output: A selection of the Si whose union is B. Cost: Number of sets picked. (In our example, the elements of B are the towns.) This problem lends itself immediately to a greedy solution: Repeat until all elements of B are covered: Pick the set Si with the largest number of uncovered elements. This is extremely natural and intuitive. Let’s see what it would do on our earlier example: It would rst place a school at town a, since this covers the largest number of other towns. Thereafter, it would choose three more schools c, j, and either f or g for a total of four. However, there exists a solution with just three schools, at b, e, and i. The greedy scheme is not optimal! But luckily, it isn’t too far from optimal. Claim Suppose B contains n elements and that the optimal cover consists of k sets. Then the greedy algorithm will use at most klnn sets.2 Let nt be the number of elements still not covered after t iterations of the greedy algorithm (so n0 = n). Since these remaining elements are covered by the optimal k sets, there must be some set with at least nt=k of them. Therefore, the greedy strategy will ensure that nt+1 nt ntk = nt 1 1k ; which by repeated application implies nt n0(1 1=k)t. A more convenient bound can be obtained from the useful inequality 1 x e x for all x, with equality if and only if x = 0, 2ln means natural logarithm, that is, to the base e. 160 Algorithms which is most easily proved by a picture: x0 11 x e x Thus nt n0 1 1k t < n0(e 1=k)t = ne t=k: At t = klnn, therefore, nt is strictly less than ne lnn = 1, which means no elements remain to be covered. The ratio between the greedy algorithm’s solution and the optimal solution varies from input to input but is always less than lnn. And there are certain inputs for which the ratio is very close to lnn (Exercise 5.33). We call this maximum ratio the approximation factor of the greedy algorithm. There seems to be a lot of room for improvement, but in fact such hopes are unjusti ed: it turns out that under certain widely-held complexity assumptions (which will be clearer when we reach Chapter 8), there is provably no polynomial-time algorithm with a smaller approximation factor. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 161 Exercises 5.1. Consider the following graph. A B C D E F G H 1 2 2 1 6 5 6 33 5 4 5 7 (a) What is the cost of its minimum spanning tree? (b) How many minimum spanning trees does it have? (c) Suppose Kruskal’s algorithm is run on this graph. In what order are the edges added to the MST? For each edge in this sequence, give a cut that justi es its addition. 5.2. Suppose we want to nd the minimum spanning tree of the following graph. A B C D E F G H 1 2 41268 5 64 1 1 3 (a) Run Prim’s algorithm; whenever there is a choice of nodes, always use alphabetic ordering (e.g., start from node A). Draw a table showing the intermediate values of the cost array. (b) Run Kruskal’s algorithm on the same graph. Show how the disjoint-sets data structure looks at every intermediate stage (including the structure of the directed trees), assuming path compression is used. 5.3. Design a linear-time algorithm for the following task. Input: A connected, undirected graph G. Question: Is there an edge you can remove from G while still leaving G connected? Can you reduce the running time of your algorithm to O(jVj)? 5.4. Show that if an undirected graph with n vertices has k connected components, then it has at least n k edges. 5.5. Consider an undirected graph G = (V;E) with nonnegative edge weights we 0. Suppose that you have computed a minimum spanning tree of G, and that you have also computed shortest paths to all nodes from a particular node s2V. Now suppose each edge weight is increased by 1: the new weights are w0e = we + 1. (a) Does the minimum spanning tree change? Give an example where it changes or prove it cannot change. (b) Do the shortest paths change? Give an example where they change or prove they cannot change. 162 Algorithms 5.6. Let G = (V;E) be an undirected graph. Prove that if all its edge weights are distinct, then it has a unique minimum spanning tree. 5.7. Show how to nd the maximum spanning tree of a graph, that is, the spanning tree of largest total weight. 5.8. Suppose you are given a weighted graph G = (V;E) with a distinguished vertex s and where all edge weights are positive and distinct. Is it possible for a tree of shortest paths from s and a minimum spanning tree in G to not share any edges? If so, give an example. If not, give a reason. 5.9. The following statements may or may not be correct. In each case, either prove it (if it is cor- rect) or give a counterexample (if it isn’t correct). Always assume that the graph G = (V;E) is undirected. Do not assume that edge weights are distinct unless this is speci cally stated. (a) If graph G has more thanjVj 1 edges, and there is a unique heaviest edge, then this edge cannot be part of a minimum spanning tree. (b) If G has a cycle with a unique heaviest edge e, then e cannot be part of any MST. (c) Let e be any edge of minimum weight in G. Then e must be part of some MST. (d) If the lightest edge in a graph is unique, then it must be part of every MST. (e) If e is part of some MST of G, then it must be a lightest edge across some cut of G. (f) If G has a cycle with a unique lightest edge e, then e must be part of every MST. (g) The shortest-path tree computed by Dijkstra’s algorithm is necessarily an MST. (h) The shortest path between two nodes is necessarily part of some MST. (i) Prim’s algorithm works correctly when there are negative edges. (j) (For any r> 0, de ne an r-path to be a path whose edges all have weight w(e). (b) e62E0 and ^w(e) w(e). 5.23. Sometimes we want light spanning trees with certain special properties. Here’s an example. Input: Undirected graph G = (V;E); edge weights we; subset of vertices U V Output: The lightest spanning tree in which the nodes of U are leaves (there might be other leaves in this tree as well). (The answer isn’t necessarily a minimum spanning tree.) Give an algorithm for this problem which runs in O(jEjlogjVj) time. (Hint: When you remove nodes U from the optimal solution, what is left?) 5.24. A binary counter of unspeci ed length supports two operations: increment (which increases its value by one) and reset (which sets its value back to zero). Show that, starting from an initially zero counter, any sequence of n increment and reset operations takes time O(n); that is, the amortized time per operation is O(1). 5.25. Here’s a problem that occurs in automatic program analysis. For a set of variables x1;:::;xn, you are given some equality constraints, of the form xi = xj and some disequality constraints, of the form xi 6= xj. Is it possible to satisfy all of them? For instance, the constraints x1 = x2;x2 = x3;x3 = x4;x1 6= x4 cannot be satis ed. Give an ef cient algorithm that takes as inputmconstraints overnvariables and decides whether the constraints can be satis ed. 5.26. Graphs with prescribed degree sequences. Given a list ofnpositive integersd1;d2;:::;dn, we want to ef ciently determine whether there exists an undirected graph G = (V;E) whose nodes have degrees preciselyd1;d2;:::;dn. That is, if V =fv1;:::;vng, then the degree ofvi should be exactly di. We call (d1;:::;dn) the degree sequence ofG. This graphG should not contain self-loops (edges with both endpoints equal to the same node) or multiple edges between the same pair of nodes. (a) Give an example of d1;d2;d3;d4 where all the di 3 and d1 + d2 + d3 + d4 is even, but for which no graph with degree sequence (d1;d2;d3;d4) exists. (b) Suppose that d1 d2 dn and that there exists a graph G = (V;E) with degree sequence (d1;:::;dn). We want to show that there must exist a graph that has this degree sequence and where in addition the neighbors of v1 are v2;v3;:::;vd1+1. The idea is to gradually transform G into a graph with the desired additional property. i. Suppose the neighbors of v1 in G are not v2;v3;:::;vd1+1. Show that there exists i < j n and u2V such thatfv1;vig;fu;vjg=2E andfv1;vjg;fu;vig2E. 166 Algorithms ii. Specify the changes you would make to G to obtain a new graph G0 = (V;E0) with the same degree sequence as G and where (v1;vi)2E0. iii. Now show that there must be a graph with the given degree sequence but in which v1 has neighbors v2;v3;:::;vd1+1. (c) Using the result from part (b), describe an algorithm that on input d1;:::;dn (not necessar- ily sorted) decides whether there exists a graph with this degree sequence. Your algorithm should run in time polynomial in n and in m =Pni=1di. 5.27. Alice wants to throw a party and is deciding whom to call. She has n people to choose from, and she has made up a list of which pairs of these people know each other. She wants to pick as many people as possible, subject to two constraints: at the party, each person should have at least ve other people whom they know and ve other people whom they don’t know. Give an ef cient algorithm that takes as input the list of n people and the list of pairs who know each other and outputs the best choice of party invitees. Give the running time in terms of n. 5.28. A pre x-free encoding of a nite alphabet assigns each symbol in a binary codeword, such that no codeword is a pre x of another codeword. Show that such an encoding can be represented by a full binary tree in which each leaf corre- sponds to a unique element of , whose codeword is generated by the path from the root to that leaf (interpreting a left branch as 0 and a right branch as 1). 5.29. Ternary Huffman. Trimedia Disks Inc. has developed ternary hard disks. Each cell on a disk can now store values 0, 1, or 2 (instead of just 0 or 1). To take advantage of this new technology, provide a modi ed Huffman algorithm for compressing sequences of characters from an alpha- bet of size n, where the characters occur with known frequencies f1;f2;:::;fn. Your algorithm should encode each character with a variable-length codeword over the values 0;1;2 such that no codeword is a pre x of another codeword and so as to obtain the maximum possible compression. Prove that your algorithm is correct. 5.30. The basic intuition behind Huffman’s algorithm, that frequent blocks should have short en- codings and infrequent blocks should have long encodings, is also at work in English, where typical words like I, you, is, and, to, from, and so on are short, and rarely used words like velociraptor are longer. However, words like fire!, help!, and run! are short not because they are frequent, but perhaps because time is precious in situations where they are used. To make things theoretical, suppose we have a le composed of m different words, with frequen- cies f1;:::;fm. Suppose also that for the ith word, the cost per bit of encoding is ci. Thus, if we nd a pre x-free code where the ith word has a codeword of length li, then the total cost of the encoding will be Pifi ci li. Show how to modify Huffman’s algorithm to nd the pre x-free encoding of minimum total cost. 5.31. A server has n customers waiting to be served. The service time required by each customer is known in advance: it is ti minutes for customer i. So if, for example, the customers are served in order of increasing i, then the ith customer has to wait Pij=1 tj minutes. We wish to minimize the total waiting time T = nX i=1 (time spent waiting by customer i): Give an ef cient algorithm for computing the optimal order in which to process the customers. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 167 5.32. Show how to implement the stingy algorithm for Horn formula satis ability (Section 5.3) in time that is linear in the length of the formula (the number of occurrences of literals in it). (Hint: Use a directed graph, with one node per variable, to represent the implications.) 5.33. Show that for any integer n that is a power of 2, there is an instance of the set cover problem (Section 5.4) with the following properties: i. There are n elements in the base set. ii. The optimal cover uses just two sets. iii. The greedy algorithm picks at least logn sets. Thus the approximation ratio we derived in the chapter is tight. 168 Algorithms Chapter 6 Dynamic programming In the preceding chapters we have seen some elegant design principles such as divide-and- conquer, graph exploration, and greedy choice that yield de nitive algorithms for a variety of important computational tasks. The drawback of these tools is that they can only be used on very speci c types of problems. We now turn to the two sledgehammers of the algorithms craft, dynamic programming and linear programming, techniques of very broad applicability that can be invoked when more specialized methods fail. Predictably, this generality often comes with a cost in ef ciency. 6.1 Shortest paths in dags, revisited At the conclusion of our study of shortest paths (Chapter 4), we observed that the problem is especially easy in directed acyclic graphs (dags). Let’s recapitulate this case, because it lies at the heart of dynamic programming. The special distinguishing feature of a dag is that its nodes can be linearized; that is, they can be arranged on a line so that all edges go from left to right (Figure 6.1). To see why this helps with shortest paths, suppose we want to gure out distances from node S to the other nodes. For concreteness, let’s focus on node D. The only way to get to it is through its Figure 6.1 A dag and its linearization (topological ordering). B DC A S E 1 2 4 1 6 3 1 2 S C A B D E4 6 3 1 2 1 1 2 169 170 Algorithms predecessors, B or C; so to nd the shortest path to D, we need only compare these two routes: dist(D) = minfdist(B) + 1;dist(C) + 3g: A similar relation can be written for every node. If we compute these dist values in the left-to-right order of Figure 6.1, we can always be sure that by the time we get to a node v, we already have all the information we need to compute dist(v). We are therefore able to compute all distances in a single pass: initialize all dist( ) values to 1 dist(s) = 0 for each v2Vnfsg, in linearized order: dist(v) = min(u;v)2Efdist(u) +l(u;v)g Notice that this algorithm is solving a collection of subproblems, fdist(u) : u2Vg. We start with the smallest of them, dist(s), since we immediately know its answer to be 0. We then proceed with progressively larger subproblems distances to vertices that are further and further along in the linearization where we are thinking of a subproblem as large if we need to have solved a lot of other subproblems before we can get to it. This is a very general technique. At each node, we compute some function of the values of the node’s predecessors. It so happens that our particular function is a minimum of sums, but we could just as well make it a maximum, in which case we would get longest paths in the dag. Or we could use a product instead of a sum inside the brackets, in which case we would end up computing the path with the smallest product of edge lengths. Dynamic programming is a very powerful algorithmic paradigm in which a problem is solved by identifying a collection of subproblems and tackling them one by one, smallest rst, using the answers to small problems to help gure out larger ones, until the whole lot of them is solved. In dynamic programming we are not given a dag; the dag is implicit. Its nodes are the subproblems we de ne, and its edges are the dependencies between the subproblems: if to solve subproblem B we need the answer to subproblem A, then there is a (conceptual) edge from A to B. In this case, A is thought of as a smaller subproblem than B and it will always be smaller, in an obvious sense. But it’s time we saw an example. 6.2 Longest increasing subsequences In the longest increasing subsequence problem, the input is a sequence of numbers a1;:::;an. A subsequence is any subset of these numbers taken in order, of the form ai1;ai2;:::;aik where 1 i1 < i2 < < ik n, and an increasing subsequence is one in which the numbers are getting strictly larger. The task is to nd the increasing subsequence of greatest length. For instance, the longest increasing subsequence of 5;2;8;6;3;6;9;7 is 2;3;6;9: 5 2 8 6 3 6 9 7 S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 171 Figure 6.2 The dag of increasing subsequences. 5 2 8 3 9 766 In this example, the arrows denote transitions between consecutive elements of the opti- mal solution. More generally, to better understand the solution space, let’s create a graph of all permissible transitions: establish a node i for each element ai, and add directed edges (i;j) whenever it is possible for ai and aj to be consecutive elements in an increasing subsequence, that is, whenever iw: K(w;j) = K(w;j 1) else: K(w;j) = maxfK(w;j 1);K(w wj;j 1) +vjg return K(W;n) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 183 Memoization In dynamic programming, we write out a recursive formula that expresses large problems in terms of smaller ones and then use it to ll out a table of solution values in a bottom-up manner, from smallest subproblem to largest. The formula also suggests a recursive algorithm, but we saw earlier that naive recursion can be terribly inef cient, because it solves the same subproblems over and over again. What about a more intelligent recursive implementation, one that remembers its previous invocations and thereby avoids repeating them? On the knapsack problem (with repetitions), such an algorithm would use a hash table (recall Section 1.5) to store the values of K( ) that had already been computed. At each recursive call requesting some K(w), the algorithm would rst check if the answer was already in the table and then would proceed to its calculation only if it wasn’t. This trick is called memoization: A hash table, initially empty, holds values of K(w) indexed by w function knapsack(w) if w is in hash table: return K(w) K(w) = maxfknapsack(w wi) +vi : wi wg insert K(w) into hash table, with key w return K(w) Since this algorithm never repeats a subproblem, its running time is O(nW), just like the dynamic program. However, the constant factor in this big-O notation is substantially larger because of the overhead of recursion. In some cases, though, memoization pays off. Here’s why: dynamic programming au- tomatically solves every subproblem that could conceivably be needed, while memoization only ends up solving the ones that are actually used. For instance, suppose that W and all the weightswi are multiples of 100. Then a subproblem K(w) is useless if 100 does not divide w. The memoized recursive algorithm will never look at these extraneous table entries. 184 Algorithms Figure 6.6 A B C D = (A (B C)) D. (a) C DBA 20 1 1 1050 20 10 100 (b) A B C 50 20 20 10 D 10 100 (c) A (B C) 50 10 D 10 100 (d) (A (B C)) D 50 100 6.5 Chain matrix multiplication Suppose that we want to multiply four matrices, A B C D, of dimensions 50 20, 20 1, 1 10, and 10 100, respectively (Figure 6.6). This will involve iteratively multiplying two matrices at a time. Matrix multiplication is not commutative (in general, A B6= B A), but it is associative, which means for instance that A (B C) = (A B) C. Thus we can compute our product of four matrices in many different ways, depending on how we parenthesize it. Are some of these better than others? Multiplying an m n matrix by an n p matrix takes mnp multiplications, to a good enough approximation. Using this formula, let’s compare several different ways of evaluating A B C D: Parenthesization Cost computation Cost A ((B C) D) 20 1 10 + 20 10 100 + 50 20 100 120;200 (A (B C)) D 20 1 10 + 50 20 10 + 50 10 100 60;200 (A B) (C D) 50 20 1 + 1 10 100 + 50 1 100 7;000 As you can see, the order of multiplications makes a big difference in the nal running time! Moreover, the natural greedy approach, to always perform the cheapest matrix multiplication available, leads to the second parenthesization shown here and is therefore a failure. How do we determine the optimal order, if we want to compute A1 A2 An, where the Ai’s are matrices with dimensions m0 m1;m1 m2;:::;mn 1 mn, respectively? The rst thing to notice is that a particular parenthesization can be represented very naturally by a binary tree in which the individual matrices correspond to the leaves, the root is the nal S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 185 Figure 6.7 (a) ((A B) C) D; (b) A ((B C) D); (c) (A (B C)) D. D A C BA D B C D A CB (a) (b) (c) product, and interior nodes are intermediate products (Figure 6.7). The possible orders in which to do the multiplication correspond to the various full binary trees withn leaves, whose number is exponential in n (Exercise 2.13). We certainly cannot try each tree, and with brute force thus ruled out, we turn to dynamic programming. The binary trees of Figure 6.7 are suggestive: for a tree to be optimal, its subtrees must also be optimal. What are the subproblems corresponding to the subtrees? They are products of the form Ai Ai+1 Aj. Let’s see if this works: for 1 i j n, de ne C(i;j) = minimum cost of multiplying Ai Ai+1 Aj. The size of this subproblem is the number of matrix multiplications, jj ij. The smallest subproblem is when i = j, in which case there’s nothing to multiply, so C(i;i) = 0. For j >i, consider the optimal subtree for C(i;j). The rst branch in this subtree, the one at the top, will split the product in two pieces, of the form Ai Ak and Ak+1 Aj, for some k between i and j. The cost of the subtree is then the cost of these two partial products, plus the cost of combining them: C(i;k) +C(k+ 1;j) +mi 1 mk mj. And we just need to nd the splitting point k for which this is smallest: C(i;j) = min i k 1, we de ne C(S;1) =1since the path cannot both start and end at 1. Now, let’s express C(S;j) in terms of smaller subproblems. We need to start at 1 and end at j; what should we pick as the second-to-last city? It has to be some i2S, so the overall path length is the distance from 1 to i, namely, C(S fjg;i), plus the length of the nal edge, dij. We must pick the best such i: C(S;j) = min i2S:i6=j C(S fjg;i) +dij: The subproblems are ordered byjSj. Here’s the code. C(f1g;1) = 0 for s = 2 to n: for all subsets S f1;2;:::;ng of size s and containing 1: C(S;1) =1 for all j2S;j6= 1: C(S;j) = minfC(S fjg;i) +dij : i2S;i6= jg return minj C(f1;:::;ng;j) +dj1 There are at most 2n n subproblems, and each one takes linear time to solve. The total running time is therefore O(n22n). 6.7 Independent sets in trees A subset of nodes S V is an independent set of graph G = (V;E) if there are no edges between them. For instance, in Figure 6.10 the nodes f1;5g form an independent set, but nodes f1;4;5g do not, because of the edge between 4 and 5. The largest independent set is f2;3;6g. Like several other problems we have seen in this chapter (knapsack, traveling salesman), nding the largest independent set in a graph is believed to be intractable. However, when the graph happens to be a tree, the problem can be solved in linear time, using dynamic programming. And what are the appropriate subproblems? Already in the chain matrix multiplication problem we noticed that the layered structure of a tree provides a natural de nition of a subproblem as long as one node of the tree has been identi ed as a root. So here’s the algorithm: Start by rooting the tree at any node r. Now, each node de nes a subtree the one hanging from it. This immediately suggests subproblems: I(u) = size of largest independent set of subtree hanging from u: 190 Algorithms On time and memory The amount of time it takes to run a dynamic programming algorithm is easy to discern from the dag of subproblems: in many cases it is just the total number of edges in the dag! All we are really doing is visiting the nodes in linearized order, examining each node’s inedges, and, most often, doing a constant amount of work per edge. By the end, each edge of the dag has been examined once. But how much computer memory is required? There is no simple parameter of the dag characterizing this. It is certainly possible to do the job with an amount of memory propor- tional to the number of vertices (subproblems), but we can usually get away with much less. The reason is that the value of a particular subproblem only needs to be remembered until the larger subproblems depending on it have been solved. Thereafter, the memory it takes up can be released for reuse. For example, in the Floyd-Warshall algorithm the value of dist(i;j;k) is not needed once the dist( ; ;k+1) values have been computed. Therefore, we only need twojVj jVjarrays to store the dist values, one for odd values of k and one for even values: when computing dist(i;j;k), we overwrite dist(i;j;k 2). (And let us not forget that, as always in dynamic programming, we also need one more ar- ray, prev(i;j), storing the next to last vertex in the current shortest path from i to j, a value that must be updated with dist(i;j;k). We omit this mundane but crucial bookkeeping step from our dynamic programming algorithms.) Can you see why the edit distance dag in Figure 6.5 only needs memory proportional to the length of the shorter string? Our nal goal is I(r). Dynamic programming proceeds as always from smaller subproblems to larger ones, that is to say, bottom-up in the rooted tree. Suppose we know the largest independent sets for all subtrees below a certain node u; in other words, suppose we know I(w) for all descendants w of u. How can we compute I(u)? Let’s split the computation into two cases: any independent set either includes u or it doesn’t (Figure 6.11). I(u) = max 8< :1 + X grandchildren w of u I(w); X children w of u I(w) 9= ;: If the independent set includes u, then we get one point for it, but we aren’t allowed to include the children of u therefore we move on to the grandchildren. This is the rst case in the formula. On the other hand, if we don’t include u, then we don’t get a point for it, but we can move on to its children. The number of subproblems is exactly the number of vertices. With a little care, the running time can be made linear, O(jVj+jEj). S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 191 Figure 6.10 The largest independent set in this graph has size 3. 1 2 3 4 5 6 Figure 6.11 I(u) is the size of the largest independent set of the subtree rooted at u. Two cases: either u is in this independent set, or it isn’t. r u Exercises 6.1. A contiguous subsequence of a list S is a subsequence made up of consecutive elements of S. For instance, if S is 5;15; 30;10; 5;40;10; then 15; 30;10 is a contiguous subsequence but 5;15;40 is not. Give a linear-time algorithm for the following task: Input: A list of numbers, a1;a2;:::;an. Output: The contiguous subsequence of maximum sum (a subsequence of length zero has sum zero). For the preceding example, the answer would be 10; 5;40;10, with a sum of 55. (Hint: For each j2f1;2;:::;ng, consider contiguous subsequences ending exactly at position j.) 6.2. You are going on a long trip. You start on the road at mile post 0. Along the way there are n hotels, at mile posts a1 0 and i = 1;2;:::;n. Any two restaurants should be at least k miles apart, where k is a positive integer. Give an ef cient algorithm to compute the maximum expected total pro t subject to the given constraints. 6.4. You are given a string ofn characters s[1:::n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like itwasthebestoftimes... ). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict( ): for any string w, dict(w) = true if w is a valid word false otherwise . (a) Give a dynamic programming algorithm that determines whether the string s[ ] can be reconstituted as a sequence of valid words. The running time should be at most O(n2), assuming calls to dict take unit time. (b) In the event that the string is valid, make your algorithm output the corresponding se- quence of words. 6.5. Pebbling a checkerboard. We are given a checkerboard which has 4 rows and n columns, and has an integer written in each square. We are also given a set of 2n pebbles, and we want to place some or all of these on the checkerboard (each pebble can be placed on exactly one square) so as to maximize the sum of the integers in the squares that are covered by pebbles. There is one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or vertically adjacent squares (diagonal adjacency is ne). (a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring the pebbles in adjacent columns) and describe these patterns. Call two patterns compatible if they can be placed on adjacent columns to form a legal placement. Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can be assigned a type, which is the pattern occurring in the last column. (b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algo- rithm for computing an optimal placement. 6.6. Let us de ne a multiplication operation on three symbols a;b;c according to the following table; thus ab = b, ba = c, and so on. Notice that the multiplication operation de ned by the table is neither associative nor commutative. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 193 a b c a b b a b c b a c a c c Find an ef cient algorithm that examines a string of these symbols, say bbbbac, and decides whether or not it is possible to parenthesize the string in such a way that the value of the resulting expression is a. For example, on input bbbbac your algorithm should return yes because ((b(bb))(ba))c = a. 6.7. A subsequence is palindromic if it is the same whether read left to right or right to left. For instance, the sequence A;C;G;T;G;T;C;A;A;A;A;T;C;G has many palindromic subsequences, including A;C;G;C;A and A;A;A;A (on the other hand, the subsequence A;C;T is not palindromic). Devise an algorithm that takes a sequence x[1:::n] and returns the (length of the) longest palindromic subsequence. Its running time should be O(n2). 6.8. Given two strings x = x1x2 xn and y = y1y2 ym, we wish to nd the length of their longest common substring, that is, the largestk for which there are indicesiandj withxixi+1 xi+k 1 = yjyj+1 yj+k 1. Show how to do this in time O(mn). 6.9. A certain string-processing language offers a primitive operation which splits a string into two pieces. Since this operation involves copying the original string, it takes n units of time for a string of length n, regardless of the location of the cut. Suppose, now, that you want to break a string into many pieces. The order in which the breaks are made can affect the total running time. For example, if you want to cut a 20-character string at positions 3 and 10, then making the rst cut at position 3 incurs a total cost of 20 + 17 = 37, while doing position 10 rst has a better cost of 20 + 10 = 30. Give a dynamic programming algorithm that, given the locations of m cuts in a string of length n, nds the minimum cost of breaking the string into m+ 1 pieces. 6.10. Counting heads. Given integers n and k, along with p1;:::;pn 2[0;1], you want to determine the probability of obtaining exactlyk heads whennbiased coins are tossed independently at random, where pi is the probability that the ith coin comes up heads. Give an O(n2) algorithm for this task.2 Assume you can multiply and add two numbers in [0;1] in O(1) time. 6.11. Given two strings x = x1x2 xn and y = y1y2 ym, we wish to nd the length of their longest common subsequence, that is, the largest k for which there are indices i1 < i2 < < ik and j1 0 on the edges. We would like to send as much oil as possible from s to t without exceeding the capacities of any of the edges. A particular shipping scheme is called a ow and consists of a variable fe for each edge e of the network, satisfying the following two properties: 1. It doesn’t violate edge capacities: 0 fe ce for all e2E. 2. For all nodes u except s and t, the amount of ow entering u equals the amount leaving u: X (w;u)2E fwu = X (u;z)2E fuz: In other words, ow is conserved. The size of a ow is the total quantity sent from s to t and, by the conservation principle, is equal to the quantity leaving s: size(f) = X (s;u)2E fsu: In short, our goal is to assign values to ffe : e 2 Eg that will satisfy a set of linear constraints and maximize a linear objective function. But this is a linear program! The maximum- ow problem reduces to linear programming. For example, for the network of Figure 7.4 the LP has 11 variables, one per edge. It seeks to maximize fsa + fsb +fsc subject to a total of 27 constraints: 11 for nonnegativity (such as fsa 0), 11 for capacity (such as fsa 3), and 5 for ow conservation (one for each node of the graph other than s and t, such as fsc + fdc = fce). Simplex would take no time at all to correctly solve the problem and to con rm that, in our example, a ow of 7 is in fact optimal. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 213 Figure 7.5 An illustration of the max- ow algorithm. (a) A toy network. (b) The rst path chosen. (c) The second path chosen. (d) The nal ow. (e) We could have chosen this path rst. (f) In which case, we would have to allow this second path. (a) s b a t 11 1 1 1 (b) s a t (c) s b t (d) s b a t 11 1 1 0 (e) s b a t 1 1 1 (f) s b a t 1 1 1 7.2.3 A closer look at the algorithm All we know so far of the simplex algorithm is the vague geometric intuition that it keeps making local moves on the surface of a convex feasible region, successively improving the objective function until it nally reaches the optimal solution. Once we have studied it in more detail (Section 7.6), we will be in a position to understand exactly how it handles ow LPs, which is useful as a source of inspiration for designing direct max- ow algorithms. It turns out that in fact the behavior of simplex has an elementary interpretation: Start with zero ow. Repeat: choose an appropriate path from s to t, and increase ow along the edges of this path as much as possible. Figure 7.5(a) (d) shows a small example in which simplex halts after two iterations. The nal ow has size 2, which is easily seen to be optimal. 214 Algorithms There is just one complication. What if we had initially chosen a different path, the one in Figure 7.5(e)? This gives only one unit of ow and yet seems to block all other paths. Simplex gets around this problem by also allowing paths to cancel existing ow. In this particular case, it would subsequently choose the path of Figure 7.5(f). Edge (b;a) of this path isn’t in the original network and has the effect of canceling ow previously assigned to edge (a;b). To summarize, in each iteration simplex looks for an s t path whose edges (u;v) can be of two types: 1. (u;v) is in the original network, and is not yet at full capacity. 2. The reverse edge (v;u) is in the original network, and there is some ow along it. If the current ow is f, then in the rst case, edge (u;v) can handle up to cuv fuv additional units of ow, and in the second case, upto fvu additional units (canceling all or part of the existing ow on (v;u)). These ow-increasing opportunities can be captured in a residual network Gf = (V;Ef), which has exactly the two types of edges listed, with residual capacities cf : cuv fuv if (u;v)2E and fuv 0 Thus we can equivalently think of simplex as choosing an s t path in the residual network. By simulating the behavior of simplex, we get a direct algorithm for solving max- ow. It proceeds in iterations, each time explicitly constructing Gf, nding a suitable s t path in Gf by using, say, a linear-time breadth- rst search, and halting if there is no longer any such path along which ow can be increased. Figure 7.6 illustrates the algorithm on our oil example. 7.2.4 A certi cate of optimality Now for a truly remarkable fact: not only does simplex correctly compute a maximum ow, but it also generates a short proof of the optimality of this ow! Let’s see an example of what this means. Partition the nodes of the oil network (Figure 7.4) into two groups, L =fs;a;bgand R =fc;d;e;tg: s a b c d e t3 3 4 10 1 2 1 5 1 2 5 L R Any oil transmitted must pass from L to R. Therefore, no ow can possibly exceed the total capacity of the edges from L to R, which is 7. But this means that the ow we found earlier, of size 7, must be optimal! S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 215 More generally, an (s;t)-cut partitions the vertices into two disjoint groups L and R such that s is in L and t is in R. Its capacity is the total capacity of the edges from L to R, and as argued previously, is an upper bound on any ow: Pick any ow f and any (s;t)-cut (L;R). Then size(f) capacity(L;R). Some cuts are large and give loose upper bounds cut (fs;b;cg;fa;d;e;tg) has a capacity of 19. But there is also a cut of capacity 7, which is effectively a certi cate of optimality of the maximum ow. This isn’t just a lucky property of our oil network; such a cut always exists. Max- ow min-cut theorem The size of the maximum ow in a network equals the capacity of the smallest (s;t)-cut. Moreover, our algorithm automatically nds this cut as a by-product! Let’s see why this is true. Suppose f is the nal ow when the algorithm terminates. We know that node t is no longer reachable from s in the residual network Gf. Let L be the nodes that are reachable from s in Gf , and let R = V L be the rest of the nodes. Then (L;R) is a cut in the graph G: L R ts e0 e We claim that size(f) = capacity(L;R): To see this, observe that by the way L is de ned, any edge going from L to R must be at full capacity (in the current ow f), and any edge from R to L must have zero ow. (So, in the gure, fe = ce and fe0 = 0.) Therefore the net ow across (L;R) is exactly the capacity of the cut. 7.2.5 Ef ciency Each iteration of our maximum- ow algorithm is ef cient, requiring O(jEj) time if a depth- rst or breadth- rst search is used to nd an s t path. But how many iterations are there? Suppose all edges in the original network have integer capacities C. Then an inductive argument shows that on each iteration of the algorithm, the ow is always an integer and increases by an integer amount. Therefore, since the maximum ow is at most CjEj (why?), it follows that the number of iterations is at most this much. But this is hardly a reassuring bound: what if C is in the millions? We examine this issue further in Exercise 7.31. It turns out that it is indeed possible to construct bad examples in which the number of iterations is proportional to C, if s t paths are not carefully chosen. However, if paths are chosen in a sensible manner in particular, by 216 Algorithms using a breadth- rst search, which nds the path with the fewest edges then the number of iterations is at most O(jVj jEj), no matter what the capacities are. This latter bound gives an overall running time of O(jVj jEj2) for maximum ow. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 217 Figure 7.6 The max- ow algorithm applied to the network of Figure 7.4. At each iteration, the current ow is shown on the left and the residual network on the right. The paths chosen are shown in bold. Current ow Residual graph (a) s a b c d e t s a b c d e t3 3 4 10 1 2 1 5 1 2 5 (b) s a b c d e t 1 1 1 1 1 s a b c d e t3 4 10 1 1 1 21 2 1 1 1 4 1 4 (c) s a b c d e t 1 1 1 2 2 2 s a b c d e t3 4 10 1 1 1 2 1 4 2 2 1 3 2 (d) s a b c d e t1 1 2 2 5 4 3 s a b c d e t3 10 1 1 1 2 2 2 1 3 1 1 4 5 218 Algorithms Figure 7.6 Continued Current Flow Residual Graph (e) s a b c d e t1 2 2 54 5 1 s a b c d e t3 10 1 1 1 2 2 1 54 5 1 1 (f) s a b c d e t1 2 2 54 5 21 1 s a b c d e t 10 1 1 1 2 2 54 5 2 2 1 1 S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 219 Figure 7.7 An edge between two people means they like each other. Is it possible to pair everyone up happily? Alice Beatrice Carol Danielle GIRLS Chet Dan Bob Al BOYS 7.3 Bipartite matching Figure 7.7 shows a graph with four nodes on the left representing boys and four nodes on the right representing girls.1 There is an edge between a boy and girl if they like each other (for instance, Al likes all the girls). Is it possible to choose couples so that everyone has exactly one partner, and it is someone they like? In graph-theoretic jargon, is there a perfect matching? This matchmaking game can be reduced to the maximum- ow problem, and thereby to linear programming! Create a new source node, s, with outgoing edges to all the boys; a new sink node, t, with incoming edges from all the girls; and direct all the edges in the original bipartite graph from boy to girl (Figure 7.8). Finally, give every edge a capacity of 1. Then there is a perfect matching if and only if this network has a ow whose size equals the number of couples. Can you nd such a ow in the example? Actually, the situation is slightly more complicated than just stated: what is easy to see is that the optimum integer-valued ow corresponds to the optimum matching. We would be at a bit of a loss interpreting a ow that ships 0:7 units along the edge Al Carol, for instance! 1This kind of graph, in which the nodes can be partitioned into two groups such that all edges are between the groups, is called bipartite. Figure 7.8 A matchmaking network. Each edge has a capacity of one. s t Dan Bob Chet Danielle Beatrice Alice Carol Al 220 Algorithms Fortunately, the maximum- ow problem has the following property: if all edge capacities are integers, then the optimal ow found by our algorithm is integral. We can see this directly from the algorithm, which in such cases would increment the ow by an integer amount on each iteration. Hence integrality comes for free in the maximum- ow problem. Unfortunately, this is the exception rather than the rule: as we will see in Chapter 8, it is a very dif cult problem to nd the optimum solution (or for that matter, any solution) of a general linear program, if we also demand that the variables be integers. 7.4 Duality We have seen that in networks, ows are smaller than cuts, but the maximum ow and mini- mum cut exactly coincide and each is therefore a certi cate of the other’s optimality. Remark- able as this phenomenon is, we now generalize it from maximum ow to any problem that can be solved by linear programming! It turns out that every linear maximization problem has a dual minimization problem, and they relate to each other in much the same way as ows and cuts. To understand what duality is about, recall our introductory LP with the two types of chocolate: max x1 + 6x2 x1 200 x2 300 x1 +x2 400 x1;x2 0 Simplex declares the optimum solution to be (x1;x2) = (100;300), with objective value 1900. Can this answer be checked somehow? Let’s see: suppose we take the rst inequality and add it to six times the second inequality. We get x1 + 6x2 2000: This is interesting, because it tells us that it is impossible to achieve a pro t of more than 2000. Can we add together some other combination of the LP constraints and bring this upper bound even closer to 1900? After a little experimentation, we nd that multiplying the three inequalities by 0, 5, and 1, respectively, and adding them up yields x1 + 6x2 1900: So 1900 must indeed be the best possible value! The multipliers (0;5;1) magically constitute a certi cate of optimality! It is remarkable that such a certi cate exists for this LP and even if we knew there were one, how would we systematically go about nding it? S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 221 Let’s investigate the issue by describing what we expect of these three multipliers, call them y1;y2;y3. Multiplier Inequality y1 x1 200 y2 x2 300 y3 x1 + x2 400 To start with, these yi’s must be nonnegative, for otherwise they are unquali ed to multiply inequalities (multiplying an inequality by a negative number would ip the to ). After the multiplication and addition steps, we get the bound: (y1 +y3)x1 + (y2 +y3)x2 200y1 + 300y2 + 400y3: We want the left-hand side to look like our objective function x1 + 6x2 so that the right-hand side is an upper bound on the optimum solution. For this we need y1 +y3 to be 1 and y2 +y3 to be 6. Come to think of it, it would be ne if y1 +y3 were larger than 1 the resulting certi cate would be all the more convincing. Thus, we get an upper bound x1 + 6x2 200y1 + 300y2 + 400y3 if 8 < : y1;y2;y3 0 y1 +y3 1 y2 +y3 6 9 = ;: We can easily nd y’s that satisfy the inequalities on the right by simply making them large enough, for example (y1;y2;y3) = (5;3;6). But these particular multipliers would tell us that the optimum solution of the LP is at most 200 5 + 300 3 + 400 6 = 4300, a bound that is far too loose to be of interest. What we want is a bound that is as tight as possible, so we should minimize 200y1 + 300y2 + 400y3 subject to the preceding inequalities. And this is a new linear program! Therefore, nding the set of multipliers that gives the best upper bound on our original LP is tantamount to solving a new LP: min 200y1 + 300y2 + 400y3 y1 +y3 1 y2 +y3 6 y1;y2;y3 0 By design, any feasible value of this dual LP is an upper bound on the original primal LP. So if we somehow nd a pair of primal and dual feasible values that are equal, then they must both be optimal. Here is just such a pair: Primal : (x1;x2) = (100;300); Dual : (y1;y2;y3) = (0;5;1): They both have value 1900, and therefore they certify each other’s optimality (Figure 7.9). Amazingly, this is not just a lucky example, but a general phenomenon. To start with, the preceding construction creating a multiplier for each primal constraint; writing a constraint 222 Algorithms Figure 7.9 By design, dual feasible values primal feasible values. The duality theorem tells us that moreover their optima coincide. Primal Primal feasible This duality gap is zero opt Dual feasible Objective value opt Dual Figure 7.10 A generic primal LP in matrix-vector form, and its dual. Primal LP: max cT x Ax b x 0 Dual LP: min yT b yT A cT y 0 in the dual for every variable of the primal, in which the sum is required to be above the objective coef cient of the corresponding primal variable; and optimizing the sum of the mul- tipliers weighted by the primal right-hand sides can be carried out for any LP, as shown in Figure 7.10, and in even greater generality in Figure 7.11. The second gure has one notewor- thy addition: if the primal has an equality constraint, then the corresponding multiplier (or dual variable) need not be nonnegative, because the validity of equations is preserved when multiplied by negative numbers. So, the multipliers of equations are unrestricted variables. Notice also the simple symmetry between the two LPs, in that the matrix A = (aij) de nes one primal constraint with each of its rows, and one dual constraint with each of its columns. By construction, any feasible solution of the dual is an upper bound on any feasible solution of the primal. But moreover, their optima coincide! Duality theorem If a linear program has a bounded optimum, then so does its dual, and the two optimum values coincide. When the primal is the LP that expresses the max- ow problem, it is possible to assign interpretations to the dual variables that show the dual to be none other than the minimum- cut problem (Exercise 7.25). The relation between ows and cuts is therefore just a speci c instance of the duality theorem. And in fact, the proof of this theorem falls out of the simplex algorithm, in much the same way as the max- ow min-cut theorem fell out of the analysis of the max- ow algorithm. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 223 Figure 7.11 In the most general case of linear programming, we have a set I of inequalities and a set E of equalities (a total of m = jIj+jEj constraints) over n variables, of which a subset N are constrained to be nonnegative. The dual has m = jIj+jEj variables, of which only those corresponding to I have nonnegativity constraints. Primal LP: max c1x1 + +cnxn ai1x1 + +ainxn bi for i2I ai1x1 + +ainxn = bi for i2E xj 0 for j2N Dual LP: min b1y1 + +bmym a1jy1 + +amjym cj for j2N a1jy1 + +amjym = cj for j62N yi 0 for i2I Visualizing duality One can solve the shortest-path problem by the following analog device: Given a weighted undirected graph, build a physical model of it in which each edge is a string of length equal to the edge’s weight, and each node is a knot at which the appropriate endpoints of strings are tied together. Then to nd the shortest path from s to t, just pull s away from t until the gadget is taut. It is intuitively clear that this nds the shortest path from s to t. S D C AB T There is nothing remarkable or surprising about all this until we notice the following: the shortest-path problem is a minimization problem, right? Then why are we pulling s away from t, an act whose purpose is, obviously, maximization? Answer: By pulling s away from t we solve the dual of the shortest-path problem! This dual has a very simple form (Exercise 7.28), with one variable xu for each node u: max xS xT jxu xvj wuv for all edgesfu;vg In words, the dual problem is to stretch s and t as far apart as possible, subject to the constraint that the endpoints of any edgefu;vgare separated by a distance of at most wuv. 224 Algorithms 7.5 Zero-sum games We can represent various con ict situations in life by matrix games. For example, the school- yard rock-paper-scissors game is speci ed by the payoff matrix illustrated here. There are two players, called Row and Column, and they each pick a move fromfr;p;sg. They then look up the matrix entry corresponding to their moves, and Column pays Row this amount. It is Row’s gain and Column’s loss. G = Column r p s r 0 1 1 p 1 0 1 Row s 1 1 0 Now suppose the two of them play this game repeatedly. If Row always makes the same move, Column will quickly catch on and will always play the countermove, winning every time. Therefore Row should mix things up: we can model this by allowing Row to have a mixed strategy, in which on each turn she plays r with probability x1, p with probability x2, and s with probability x3. This strategy is speci ed by the vector x = (x1;x2;x3), positive numbers that add up to 1. Similarly, Column’s mixed strategy is some y = (y1;y2;y3).2 On any given round of the game, there is an xiyj chance that Row and Column will play the ith and jth moves, respectively. Therefore the expected (average) payoff is X i;j Gij Prob[Row plays i, Column plays j] = X i;j Gijxiyj: Row wants to maximize this, while Column wants to minimize it. What payoffs can they hope to achieve in rock-paper-scissors? Well, suppose for instance that Row plays the completely random strategy x = (1=3;1=3;1=3). If Column plays r, then the average payoff (reading the rst column of the game matrix) will be 1 3 0 + 1 3 1 + 1 3 1 = 0: This is also true if Column plays p, or s. And since the payoff of any mixed strategy (y1;y2;y3) is just a weighted average of the individual payoffs for playingr, p, and s, it must also be zero. This can be seen directly from the preceding formula, X i;j Gijxiyj = X i;j Gij 13yj = X j yj X i 1 3Gij ! = X j yj 0 = 0; where the second-to-last equality is the observation that every column of G adds up to zero. Thus by playing the completely random strategy, Row forces an expected payoff of zero, no matter what Column does. This means that Column cannot hope for a negative (expected) 2Also of interest are scenarios in which players alter their strategies from round to round, but these can get very complicated and are a vast subject unto themselves. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 225 payoff (remember that he wants the payoff to be as small as possible). But symmetrically, if Column plays the completely random strategy, he also forces an expected payoff of zero, and thus Row cannot hope for a positive (expected) payoff. In short, the best each player can do is to play completely randomly, with an expected payoff of zero. We have mathematically con rmed what you knew all along about rock-paper-scissors! Let’s think about this in a slightly different way, by considering two scenarios: 1. First Row announces her strategy, and then Column picks his. 2. First Column announces his strategy, and then Row chooses hers. We’ve seen that the average payoff is the same (zero) in either case if both parties play op- timally. But this might well be due to the high level of symmetry in rock-paper-scissors. In general games, we’d expect the rst option to favor Column, since he knows Row’s strategy and can fully exploit it while choosing his own. Likewise, we’d expect the second option to favor Row. Amazingly, this is not the case: if both play optimally, then it doesn’t hurt a player to announce his or her strategy in advance! What’s more, this remarkable property is a con- sequence of and in fact equivalent to linear programming duality. Let’s investigate this with a nonsymmetric game. Imagine a presidential election scenario in which there are two candidates for of ce, and the moves they make correspond to campaign issues on which they can focus (the initials stand for economy, society, morality, and tax cut). The payoff entries are millions of votes lost by Column. G = m t e 3 1 s 2 1 Suppose Row announces that she will play the mixed strategy x = (1=2;1=2). What should Column do? Move m will incur an expected loss of 1=2, while t will incur an expected loss of 0. The best response of Column is therefore the pure strategy y = (0;1). More generally, once Row’s strategy x = (x1;x2) is xed, there is always a pure strategy that is optimal for Column: either move m, with payoff 3x1 2x2, or t, with payoff x1 +x2, whichever is smaller. After all, any mixed strategy y is a weighted average of these two pure strategies and thus cannot beat the better of the two. Therefore, if Row is forced to announce x before Column plays, she knows that his best response will achieve an expected payoff of minf3x1 2x2; x1 + x2g. She should choose x defensively to maximize her payoff against this best response: Pick (x1;x2) that maximizes minf3x1 2x2; x1 +x2g| {z } payoff from Column’s best response to x This choice of xi’s gives Row the best possible guarantee about her expected payoff. And we will now see that it can be found by an LP! The main trick is to notice that for xed x1 and x2 the following are equivalent: 226 Algorithms z = minf3x1 2x2; x1 +x2g max z z 3x1 2x2 z x1 +x2 And Row needs to choose x1 and x2 to maximize this z. max z 3x1 + 2x2 + z 0 x1 x2 + z 0 x1 + x2 = 1 x1;x2 0 Symmetrically, if Column has to announce his strategy rst, his best bet is to choose the mixed strategy y that minimizes his loss under Row’s best response, in other words, Pick (y1;y2) that minimizes maxf3y1 y2; 2y1 +y2g| {z } outcome of Row’s best response to y In LP form, this is min w 3y1 + y2 + w 0 2y1 y2 + w 0 y1 + y2 = 1 y1;y2 0 The crucial observation now is that these two LPs are dual to each other (see Figure 7.11)! Hence, they have the same optimum, call it V. Let us summarize. By solving an LP, Row (the maximizer) can determine a strategy for herself that guarantees an expected outcome of at least V no matter what Column does. And by solving the dual LP, Column (the minimizer) can guarantee an expected outcome of at most V, no matter what Row does. It follows that this is the uniquely de ned optimal play: a priori it wasn’t even certain that such a play existed. V is known as the value of the game. In our example, it is 1=7 and is realized when Row plays her optimum mixed strategy (3=7;4=7) and Column plays his optimum mixed strategy (2=7;5=7). This example is easily generalized to arbitrary games and shows the existence of mixed strategies that are optimal for both players and achieve the same value a fundamental result of game theory called the min-max theorem. It can be written in equation form as follows: maxx miny X i;j Gijxiyj = miny maxx X i;j Gijxiyj: This is surprising, because the left-hand side, in which Row has to announce her strategy rst, should presumably be better for Column than the right-hand side, in which he has to go rst. Duality equalizes the two, as it did with maximum ows and minimum cuts. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 227 Figure 7.12 A polyhedron de ned by seven inequalities. x1 x3 x2 1 4 2 3 5 6 7 A B C max x1 + 6x2 + 13x3 x1 200 1 x2 300 2 x1 +x2 +x3 400 3 x2 + 3x3 600 4 x1 0 5 x2 0 6 x3 0 7 7.6 The simplex algorithm The extraordinary power and expressiveness of linear programs would be little consolation if we did not have a way to solve them ef ciently. This is the role of the simplex algorithm. At a high level, the simplex algorithm takes a set of linear inequalities and a linear objec- tive function and nds the optimal feasible point by the following strategy: let v be any vertex of the feasible region while there is a neighbor v0 of v with better objective value: set v = v0 In our 2D and 3D examples (Figure 7.1 and Figure 7.2), this was simple to visualize and made intuitive sense. But what if there are n variables, x1;:::;xn? Any setting of the xi’s can be represented by an n-tuple of real numbers and plotted in n-dimensional space. A linear equation involving the xi’s de nes a hyperplane in this same space Rn, and the corresponding linear inequality de nes a half-space, all points that are either precisely on the hyperplane or lie on one particular side of it. Finally, the feasible region of the linear program is speci ed by a set of inequalities and is therefore the intersection of the corresponding half-spaces, a convex polyhedron. But what do the concepts of vertex and neighbor mean in this general context? 7.6.1 Vertices and neighbors in n-dimensional space Figure 7.12 recalls an earlier example. Looking at it closely, we see that each vertex is the unique point at which some subset of hyperplanes meet. Vertex A, for instance, is the sole point at which constraints 2 , 3 , and 7 are satis ed with equality. On the other hand, the 228 Algorithms hyperplanes corresponding to inequalities 4 and 6 do not de ne a vertex, because their intersection is not just a single point but an entire line. Let’s make this de nition precise. Pick a subset of the inequalities. If there is a unique point that satis es them with equality, and this point happens to be feasible, then it is a vertex. How many equations are needed to uniquely identify a point? When there are n variables, we need at least nlinear equations if we want a unique solution. On the other hand, having more than n equations is redundant: at least one of them can be rewritten as a linear combination of the others and can therefore be disregarded. In short, Each vertex is speci ed by a set of n inequalities.3 A notion of neighbor now follows naturally. Two vertices are neighbors if they have n 1 de ning inequalities in common. In Figure 7.12, for instance, vertices A and C share the two de ning inequalitiesf 3 ; 7 gand are thus neighbors. 7.6.2 The algorithm On each iteration, simplex has two tasks: 1. Check whether the current vertex is optimal (and if so, halt). 2. Determine where to move next. As we will see, both tasks are easy if the vertex happens to be at the origin. And if the vertex is elsewhere, we will transform the coordinate system to move it to the origin! First let’s see why the origin is so convenient. Suppose we have some generic LP max cT x Ax b x 0 where x is the vector of variables, x = (x1;:::;xn). Suppose the origin is feasible. Then it is certainly a vertex, since it is the unique point at which the n inequalitiesfx1 0;:::;xn 0g are tight. Now let’s solve our two tasks. Task 1: The origin is optimal if and only if all ci 0. 3There is one tricky issue here. It is possible that the same vertex might be generated by different subsets of inequalities. In Figure 7.12, vertex B is generated by f 2 ; 3 ; 4 g, but also by f 2 ; 4 ; 5 g. Such vertices are called degenerate and require special consideration. Let’s assume for the time being that they don’t exist, and we’ll return to them later. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 229 If all ci 0, then considering the constraints x 0, we can’t hope for a better objective value. Conversely, if some ci > 0, then the origin is not optimal, since we can increase the objective function by raising xi. Thus, for task 2, we can move by increasing some xi for which ci > 0. How much can we increase it? Until we hit some other constraint. That is, we release the tight constraint xi 0 and increase xi until some other inequality, previously loose, now becomes tight. At that point, we again have exactly n tight inequalities, so we are at a new vertex. For instance, suppose we’re dealing with the following linear program. max 2x1 + 5x2 2x1 x2 4 1 x1 + 2x2 9 2 x1 +x2 3 3 x1 0 4 x2 0 5 Simplex can be started at the origin, which is speci ed by constraints 4 and 5 . To move, we release the tight constraint x2 0. As x2 is gradually increased, the rst constraint it runs into is x1 + x2 3, and thus it has to stop at x2 = 3, at which point this new inequality is tight. The new vertex is thus given by 3 and 4 . So we know what to do if we are at the origin. But what if our current vertex u is else- where? The trick is to transform u into the origin, by shifting the coordinate system from the usual (x1;:::;xn) to the local view from u. These local coordinates consist of (appropriately scaled) distances y1;:::;yn to the n hyperplanes (inequalities) that de ne and enclose u: y2 y 1 x u Speci cally, if one of these enclosing inequalities is ai x bi, then the distance from a point x to that particular wall is yi = bi ai x: The n equations of this type, one per wall, de ne the yi’s as linear functions of the xi’s, and this relationship can be inverted to express the xi’s as a linear function of the yi’s. Thus we can rewrite the entire LP in terms of the y’s. This doesn’t fundamentally change it (for instance, the optimal value stays the same), but expresses it in a different coordinate frame. The revised local LP has the following three properties: 230 Algorithms 1. It includes the inequalities y 0, which are simply the transformed versions of the inequalities de ning u. 2. u itself is the origin in y-space. 3. The cost function becomes maxcu + ~cT y, where cu is the value of the objective function at u and ~c is a transformed cost vector. In short, we are back to the situation we know how to handle! Figure 7.13 shows this algo- rithm in action, continuing with our earlier example. The simplex algorithm is now fully de ned. It moves from vertex to neighboring vertex, stopping when the objective function is locally optimal, that is, when the coordinates of the local cost vector are all zero or negative. As we’ve just seen, a vertex with this property must also be globally optimal. On the other hand, if the current vertex is not locally optimal, then its local coordinate system includes some dimension along which the objective function can be improved, so we move along this direction along this edge of the polyhedron until we reach a neighboring vertex. By the nondegeneracy assumption (see footnote 3 in Section 7.6.1), this edge has nonzero length, and so we strictly improve the objective value. Thus the process must eventually halt. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 231 Figure 7.13 Simplex in action. Initial LP: max 2x1 + 5x2 2x1 x2 4 1 x1 + 2x2 9 2 x1 +x2 3 3 x1 0 4 x2 0 5 Current vertex: f4 ; 5 g(origin). Objective value: 0. Move: increase x2. 5 is released, 3 becomes tight. Stop at x2 = 3. New vertexf4 ; 3 ghas local coordinates (y1;y2): y1 = x1; y2 = 3 +x1 x2 Rewritten LP: max 15 + 7y1 5y2 y1 +y2 7 1 3y1 2y2 3 2 y2 0 3 y1 0 4 y1 +y2 3 5 Current vertex: f4 ; 3 g. Objective value: 15. Move: increase y1. 4 is released, 2 becomes tight. Stop at y1 = 1. New vertexf2 ; 3 ghas local coordinates (z1;z2): z1 = 3 3y1 + 2y2; z2 = y2 Rewritten LP: max 22 73z1 13z2 13z1 + 53z2 6 1 z1 0 2 z2 0 3 1 3z1 2 3z2 1 4 1 3z1 + 1 3z2 4 5 Current vertex: f2 ; 3 g. Objective value: 22. Optimal: all ci < 0. Solve 2 ; 3 (in original LP) to get optimal solution (x1;x2) = (1;4). f1 ; 2 g f3 ; 4 g f2 ; 3 g y1 x2 Increase Increase f1 ; 5 gf4 ; 5 g 232 Algorithms 7.6.3 Loose ends There are several important issues in the simplex algorithm that we haven’t yet mentioned. The starting vertex. How do we nd a vertex at which to start simplex? In our 2D and 3D examples we always started at the origin, which worked because the linear programs happened to have inequalities with positive right-hand sides. In a general LP we won’t always be so fortunate. However, it turns out that nding a starting vertex can be reduced to an LP and solved by simplex! To see how this is done, start with any linear program in standard form (recall Sec- tion 7.1.4), since we know LPs can always be rewritten this way. min cT x such that Ax = b and x 0: We rst make sure that the right-hand sides of the equations are all nonnegative: if bi < 0, just multiply both sides of the ith equation by 1. Then we create a new LP as follows: Create m new arti cial variables z1;:::;zm 0, where m is the number of equations. Add zi to the left-hand side of the ith equation. Let the objective, to be minimized, be z1 +z2 + +zm. For this new LP, it’s easy to come up with a starting vertex, namely, the one with zi = bi for all i and all other variables zero. Therefore we can solve it by simplex, to obtain the optimum solution. There are two cases. If the optimum value of z1 + +zm is zero, then all zi’s obtained by simplex are zero, and hence from the optimum vertex of the new LP we get a starting feasible vertex of the original LP, just by ignoring the zi’s. We can at last start simplex! But what if the optimum objective turns out to be positive? Let us think. We tried to minimize the sum of the zi’s, but simplex decided that it cannot be zero. But this means that the original linear program is infeasible: it needs some nonzero zi’s to become feasible. This is how simplex discovers and reports that an LP is infeasible. Degeneracy. In the polyhedron of Figure 7.12 vertex B is degenerate. Geometrically, this means that it is the intersection of more than n = 3 faces of the polyhedron (in this case, 2 ; 3 ; 4 ; 5 ). Algebraically, it means that if we choose any one of four sets of three inequal- ities (f 2 ; 3 ; 4 g;f 2 ; 3 ; 5 g;f 2 ; 4 ; 5 g, andf 3 ; 4 ; 5 g) and solve the corresponding system of three linear equations in three unknowns, we’ll get the same solution in all four cases: (0;300;100). This is a serious problem: simplex may return a suboptimal degenerate vertex simply because all its neighbors are identical to it and thus have no better objective. And if we modify simplex so that it detects degeneracy and continues to hop from vertex to vertex despite lack of any improvement in the cost, it may end up looping forever. One way to x this is by a perturbation: change each bi by a tiny random amount to bi i. This doesn’t change the essence of the LP since the i’s are tiny, but it has the effect of differ- entiating between the solutions of the linear systems. To see why geometrically, imagine that S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 233 the four planes 2 ; 3 ; 4 ; 5 were jolted a little. Wouldn’t vertex B split into two vertices, very close to one another? Unboundedness. In some cases an LP is unbounded, in that its objective function can be made arbitrarily large (or small, if it’s a minimization problem). If this is the case, simplex will discover it: in exploring the neighborhood of a vertex, it will notice that taking out an inequality and adding another leads to an underdetermined system of equations that has an in nity of solutions. And in fact (this is an easy test) the space of solutions contains a whole line across which the objective can become larger and larger, all the way to 1. In this case simplex halts and complains. 7.6.4 The running time of simplex What is the running time of simplex, for a generic linear program max cT x such that Ax 0 and x 0; where there are n variables and A contains m inequality constraints? Since it is an iterative algorithm that proceeds from vertex to vertex, let’s start by computing the time taken for a single iteration. Suppose the current vertex is u. By de nition, it is the unique point at which n inequality constraints are satis ed with equality. Each of its neighbors shares n 1 of these inequalities, so u can have at most n m neighbors: choose which inequality to drop and which new one to add. A naive way to perform an iteration would be to check each potential neighbor to see whether it really is a vertex of the polyhedron and to determine its cost. Finding the cost is quick, just a dot product, but checking whether it is a true vertex involves solving a system of n equations in n unknowns (that is, satisfying the n chosen inequalities exactly) and checking whether the result is feasible. By Gaussian elimination (see the following box) this takes O(n3) time, giving an unappetizing running time of O(mn4) per iteration. Fortunately, there is a much better way, and this mn4 factor can be improved to mn, mak- ing simplex a practical algorithm. Recall our earlier discussion (Section 7.6.2) about the local view from vertex u. It turns out that the per-iteration overhead of rewriting the LP in terms of the current local coordinates is just O((m + n)n); this exploits the fact that the local view changes only slightly between iterations, in just one of its de ning inequalities. Next, to select the best neighbor, we recall that the (local view of) the objective function is of the form max cu+~c y wherecu is the value of the objective function at u. This immediately identi es a promising direction to move: we pick any ~ci > 0 (if there is none, then the current vertex is optimal and simplex halts). Since the rest of the LP has now been rewritten in terms of the y-coordinates, it is easy to determine how much yi can be increased before some other inequality is violated. (And if we can increase yi inde nitely, we know the LP is unbounded.) It follows that the running time per iteration of simplex is just O(mn). But how many iterations could there be? Naturally, there can’t be more than m+nn , which is an upper bound on the number of vertices. But this upper bound is exponential in n. And in fact, there are examples of LPs for which simplex does indeed take an exponential number of iterations. In 234 Algorithms other words, simplex is an exponential-time algorithm. However, such exponential examples do not occur in practice, and it is this fact that makes simplex so valuable and so widely used. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 235 Gaussian elimination Under our algebraic de nition, merely writing down the coordinates of a vertex involves solving a system of linear equations. How is this done? We are given a system of n linear equations in n unknowns, say n = 4 and x1 2x3 = 2 x2 + x3 = 3 x1 + x2 x4 = 4 x2 + 3x3 + x4 = 5 The high school method for solving such systems is to repeatedly apply the following rule: if we add a multiple of one equation to another equation, the overall system of equations remains equivalent. For example, adding 1 times the rst equation to the third one, we get the equivalent system x1 2x3 = 2 x2 + x3 = 3 x2 + 2x3 x4 = 2 x2 + 3x3 + x4 = 5 This transformation is clever in the following sense: it eliminates the variable x1 from the third equation, leaving just one equation withx1. In other words, ignoring the rst equation, we have a system of three equations in three unknowns: we decreased n by 1! We can solve this smaller system to get x2;x3;x4, and then plug these into the rst equation to get x1. This suggests an algorithm once more due to Gauss. procedure gauss(E;X) Input: A system E =fe1;:::;eng of equations in n unknowns X =fx1;:::;xng: e1 : a11x1 +a12x2 + +a1nxn = b1; ; en : an1x1 +an2x2 + +annxn = bn Output: A solution of the system, if one exists if all coefficients ai1 are zero: halt with message ‘‘either infeasible or not linearly independent’’ if n = 1: return b1=a11 choose the coefficient ap1 of largest magnitude, and swap equations e1;ep for i = 2 to n: ei = ei (ai1=a11) e1 (x2;:::;xn) = gauss(E fe1g;X fx1g) x1 = (b1 Pj>1a1jxj)=a11 return (x1;:::;xn) (When choosing the equation to swap into rst place, we pick the one with largest jap1j for reasons of numerical accuracy; after all, we will be dividing by ap1.) Gaussian elimination uses O(n2) arithmetic operations to reduce the problem size from n to n 1, and thus uses O(n3) operations overall. To show that this is also a good estimate of the total running time, we need to argue that the numbers involved remain polynomi- ally bounded for instance, that the solution (x1;:::;xn) does not require too much more precision to write down than the original coef cients aij and bi. Do you see why this is true? 236 Algorithms Linear programming in polynomial time Simplex is not a polynomial time algorithm. Certain rare kinds of linear programs cause it to go from one corner of the feasible region to a better corner and then to a still better one, and so on for an exponential number of steps. For a long time, linear programming was considered a paradox, a problem that can be solved in practice, but not in theory! Then, in 1979, a young Soviet mathematician called Leonid Khachiyan came up with the ellipsoid algorithm, one that is very different from simplex, extremely simple in its conception (but sophisticated in its proof) and yet one that solves any linear program in polynomial time. Instead of chasing the solution from one corner of the polyhedron to the next, Khachiyan’s algorithm con nes it to smaller and smaller ellipsoids (skewed high- dimensional balls). When this algorithm was announced, it became a kind of mathematical Sputnik, a splashy achievement that had the U.S. establishment worried, in the height of the Cold War, about the possible scienti c superiority of the Soviet Union. The ellipsoid algorithm turned out to be an important theoretical advance, but did not compete well with simplex in practice. The paradox of linear programming deepened: A problem with two algorithms, one that is ef cient in theory, and one that is ef cient in practice! A few years later Narendra Karmarkar, a graduate student at UC Berkeley, came up with a completely different idea, which led to another provably polynomial algorithm for linear programming. Karmarkar’s algorithm is known as the interior point method, because it does just that: it dashes to the optimum corner not by hopping from corner to corner on the surface of the polyhedron like simplex does, but by cutting a clever path in the interior of the polyhedron. And it does perform well in practice. But perhaps the greatest advance in linear programming algorithms was not Khachiyan’s theoretical breakthrough or Karmarkar’s novel approach, but an unexpected consequence of the latter: the erce competition between the two approaches, simplex and interior point, resulted in the development of very fast code for linear programming. 7.7 Postscript: circuit evaluation The importance of linear programming stems from the astounding variety of problems that reduce to it and thereby bear witness to its expressive power. In a sense, this next one is the ultimate application. We are given a Boolean circuit, that is, a dag of gates of the following types. Input gates have indegree zero, with value true or false. AND gates and OR gates have indegree 2. NOT gates have indegree 1. In addition, one of the gates is designated as the output. Here’s an example. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 237 true AND NOT AND OR OR NOT output false true The CIRCUIT VALUE problem is the following: when the laws of Boolean logic are applied to the gates in topological order, does the output evaluate to true? There is a simple, automatic way of translating this problem into a linear program. Create a variable xg for each gate g, with constraints 0 xg 1. Add additional constraints for each type of gate: gate g g g xg = 1 xh AND NOTOR xg xh xg xh0 xg xh xg xh0 xg xh +xh0 h hh0 h0 h xg xh +xh0 1 falsetrue g xg = 1 xg = 0 g These constraints force all the gates to take on exactly the right values 0 for false, and 1 for true. We don’t need to maximize or minimize anything, and we can read the answer off from the variable xo corresponding to the output gate. This is a straightforward reduction to linear programming, from a problem that may not seem very interesting at rst. However, the CIRCUIT VALUE problem is in a sense the most general problem solvable in polynomial time! After all, any algorithm will eventually run on a computer, and the computer is ultimately a Boolean combinational circuit implemented on a chip. If the algorithm runs in polynomial time, it can be rendered as a Boolean circuit con- sisting of polynomially many copies of the computer’s circuit, one per unit of time, with the values of the gates in one layer used to compute the values for the next. Hence, the fact that CIRCUIT VALUE reduces to linear programming means that all problems that can be solved in polynomial time do! 238 Algorithms In our next topic, NP-completeness, we shall see that many hard problems reduce, much the same way, to integer programming, linear programming’s dif cult twin. Another parting thought: by what other means can the circuit evaluation problem be solved? Let’s think a circuit is a dag. And what algorithmic technique is most appropriate for solving problems on dags? That’s right: dynamic programming! Together with linear programming, the world’s two most general algorithmic techniques. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 239 Exercises 7.1. Consider the following linear program. maximize 5x+ 3y 5x 2y 0 x+y 7 x 5 x 0 y 0 Plot the feasible region and identify the optimal solution. 7.2. Duckwheat is produced in Kansas and Mexico and consumed in New York and California. Kansas produces 15 shnupells of duckwheat and Mexico 8. Meanwhile, New York consumes 10 shnupells and California 13. The transportation costs per shnupell are $4 from Mexico to New York, $1 from Mexico to California, $2 from Kansas to New York, and $3 and from Kansas to California. Write a linear program that decides the amounts of duckwheat (in shnupells and fractions of a shnupell) to be transported from each producer to each consumer, so as to minimize the overall transportation cost. 7.3. A cargo plane can carry a maximum weight of 100 tons and a maximum volume of 60 cubic meters. There are three materials to be transported, and the cargo company may choose to carry any amount of each, upto the maximum available limits given below. Material 1 has density 2 tons/cubic meter, maximum available amount 40 cubic meters, and revenue $1,000 per cubic meter. Material 2 has density 1 ton/cubic meter, maximum available amount 30 cubic meters, and revenue $1,200 per cubic meter. Material 3 has density 3 tons/cubic meter, maximum available amount 20 cubic meters, and revenue $12,000 per cubic meter. Write a linear program that optimizes revenue within the constraints. 7.4. Moe is deciding how much Regular Duff beer and how much Duff Strong beer to order each week. Regular Duff costs Moe $1 per pint and he sells it at $2 per pint; Duff Strong costs Moe $1:50 per pint and he sells it at $3 per pint. However, as part of a complicated marketing scam, the Duff company will only sell a pint of Duff Strong for each two pints or more of Regular Duff that Moe buys. Furthermore, due to past events that are better left untold, Duff will not sell Moe more than 3;000 pints per week. Moe knows that he can sell however much beer he has. Formulate a linear program for deciding how much Regular Duff and how much Duff Strong to buy, so as to maximize Moe’s pro t. Solve the program geometrically. 7.5. The Canine Products company offers two dog foods, Frisky Pup and Husky Hound, that are made from a blend of cereal and meat. A package of Frisky Pup requires 1 pound of cereal and 1:5 pounds of meat, and sells for $7. A package of Husky Hound uses 2 pounds of cereal and 1 pound of meat, and sells for $6. Raw cereal costs $1 per pound and raw meat costs $2 per pound. It also costs $1:40 to package the Frisky Pup and $0:60 to package the Husky Hound. A total of 240;000 pounds of cereal and 180;000 pounds of meat are available each month. The only production bottleneck is that the factory can only package 110;000 bags of Frisky Pup per month. Needless to say, management would like to maximize pro t. 240 Algorithms (a) Formulate the problem as a linear program in two variables. (b) Graph the feasible region, give the coordinates of every vertex, and circle the vertex maxi- mizing pro t. What is the maximum pro t possible? 7.6. Give an example of a linear program in two variables whose feasible region is in nite, but such that there is an optimum solution of bounded cost. 7.7. Find necessary and suf cient conditions on the reals a and b under which the linear program max x+y ax+by 1 x;y 0 (a) Is infeasible. (b) Is unbounded. (c) Has a unique optimal solution. 7.8. You are given the following points in the plane: (1;3); (2;5); (3;7); (5;11); (7;14); (8;15); (10;19): You want to nd a line ax+by = c that approximately passes through these points (no line is a perfect t). Write a linear program (you don’t need to solve it) to nd the line that minimizes the maximum absolute error, max 1 i 7 jaxi +byi cj: 7.9. A quadratic programming problem seeks to maximize a quadratric objective function (with terms like 3x21 or 5x1x2) subject to a set of linear constraints. Give an example of a quadratic program in two variables x1;x2 such that the feasible region is nonempty and bounded, and yet none of the vertices of this region optimize the (quadratic) objective. 7.10. For the following network, with edge capacities as shown, nd the maximum ow from S to T, along with a matching cut. A B C G T D E F 4 1 6 10 2 20 2 5 1 10 5 4 12 6 2 S 7.11. Write the dual to the following linear program. max x+y 2x+y 3 x+ 3y 5 x;y 0 Find the optimal solutions to both primal and dual LPs. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 241 7.12. For the linear program max x1 2x3 x1 x2 1 2x2 x3 1 x1;x2;x3 0 prove that the solution (x1;x2;x3) = (3=2;1=2;0) is optimal. 7.13. Matching pennies. In this simple two-player game, the players (call them R and C) each choose an outcome, heads or tails. If both outcomes are equal, C gives a dollar to R; if the outcomes are different, R gives a dollar to C. (a) Represent the payoffs by a 2 2 matrix. (b) What is the value of this game, and what are the optimal strategies for the two players? 7.14. The pizza business in Little Town is split between two rivals, Tony and Joey. They are each investigating strategies to steal business away from the other. Joey is considering either lowering prices or cutting bigger slices. Tony is looking into starting up a line of gourmet pizzas, or offering outdoor seating, or giving free sodas at lunchtime. The effects of these various strategies are summarized in the following payoff matrix (entries are dozens of pizzas, Joey’s gain and Tony’s loss). TONY Gourmet Seating Free soda JOEY Lower price +2 0 3 Bigger slices 1 2 +1 For instance, if Joey reduces prices and Tony goes with the gourmet option, then Tony will lose 2 dozen pizzas worth of business to Joey. What is the value of this game, and what are the optimal strategies for Tony and Joey? 7.15. Find the value of the game speci ed by the following payoff matrix. 0 0 1 1 0 1 2 1 1 1 1 1 1 0 0 1 1 2 0 3 1 1 1 1 0 3 2 1 0 2 1 1 (Hint: Consider the mixed strategies (1=3;0;0;1=2;1=6;0;0;0) and (2=3;0;0;1=3).) 7.16. A salad is any combination of the following ingredients: (1) tomato, (2) lettuce, (3) spinach, (4) carrot, and (5) oil. Each salad must contain: (A) at least 15 grams of protein, (B) at least 2 and at most 6 grams of fat, (C) at least 4 grams of carbohydrates, (D) at most 100 milligrams of sodium. Furthermore, (E) you do not want your salad to be more than 50% greens by mass. The nutritional contents of these ingredients (per 100 grams) are 242 Algorithms ingredient energy protein fat carbohydrate sodium (kcal) (grams) (grams) (grams) (milligrams) tomato 21 0.85 0.33 4.64 9.00 lettuce 16 1.62 0.20 2.37 8.00 spinach 371 12.78 1.58 74.69 7.00 carrot 346 8.39 1.39 80.70 508.20 oil 884 0.00 100.00 0.00 0.00 Find a linear programming applet on the Web and use it to make the salad with the fewest calories under the nutritional constraints. Describe your linear programming formulation and the optimal solution (the quantity of each ingredient and the value). Cite the Web resources that you used. 7.17. Consider the following network (the numbers are edge capacities). A B C D TS 7 6 3 4 2 2 5 9 (a) Find the maximum ow f and a minimum cut. (b) Draw the residual graphGf (along with its edge capacities). In this residual network, mark the vertices reachable from S and the vertices from which T is reachable. (c) An edge of a network is called a bottleneck edge if increasing its capacity results in an increase in the maximum ow. List all bottleneck edges in the above network. (d) Give a very simple example (containing at most four nodes) of a network which has no bottleneck edges. (e) Give an ef cient algorithm to identify all bottleneck edges in a network. (Hint: Start by running the usual network ow algorithm, and then examine the residual graph.) 7.18. There are many common variations of the maximum ow problem. Here are four of them. (a) There are many sources and many sinks, and we wish to maximize the total ow from all sources to all sinks. (b) Each vertex also has a capacity on the maximum ow that can enter it. (c) Each edge has not only a capacity, but also a lower bound on the ow it must carry. (d) The outgoing ow from each node u is not the same as the incoming ow, but is smaller by a factor of (1 u), where u is a loss coef cient associated with node u. Each of these can be solved ef ciently. Show this by reducing (a) and (b) to the original max- ow problem, and reducing (c) and (d) to linear programming. 7.19. Suppose someone presents you with a solution to a max- ow problem on some network. Give a linear time algorithm to determine whether the solution does indeed give a maximum ow. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 243 7.20. Consider the following generalization of the maximum ow problem. You are given a directed network G = (V;E) with edge capacitiesfceg. Instead of a single (s;t) pair, you are given multiple pairs (s1;t1);(s2;t2);:::;(sk;tk), where thesi are sources ofGand the ti are sinks of G. You are also given k demands d1;:::;dk. The goal is to nd k ows f(1);:::;f(k) with the following properties: f(i) is a valid ow from si to ti. For each edge e, the total ow f(1)e +f(2)e + +f(k)e does not exceed the capacity ce. The size of each ow f(i) is at least the demand di. The size of the total ow (the sum of the ows) is as large as possible. How would you solve this problem? 7.21. An edge of a ow network is called critical if decreasing the capacity of this edge results in a decrease in the maximum ow. Give an ef cient algorithm that nds a critical edge in a network. 7.22. In a particular networkG = (V;E) whose edges have integer capacities ce, we have already found the maximum ow f from node s to node t. However, we now nd out that one of the capacity values we used was wrong: for edge (u;v) we used cuv whereas it should have been cuv 1. This is unfortunate because the ow f uses that particular edge at full capacity: fuv = cuv. We could redo the ow computation from scratch, but there’s a faster way. Show how a new optimal ow can be computed in O(jVj+jEj) time. 7.23. A vertex cover of an undirected graph G = (V;E) is a subset of the vertices which touches every edge that is, a subset S V such that for each edgefu;vg2E, one or both of u;v are in S. Show that the problem of nding the minimum vertex cover in a bipartite graph reduces to max- imum ow. (Hint: Can you relate this problem to the minimum cut in an appropriate network?) 7.24. Direct bipartite matching. We’ve seen how to nd a maximum matching in a bipartite graph via reduction to the maximum ow problem. We now develop a direct algorithm. Let G = (V1[V2;E) be a bipartite graph (so each edge has one endpoint in V1 and one endpoint in V2), and let M2E be a matching in the graph (that is, a set of edges that don’t touch). A vertex is said to be covered by M if it is the endpoint of one of the edges in M. An alternating path is a path of odd length that starts and ends with a non-covered vertex, and whose edges alternate between M and E M. (a) In the bipartite graph below, a matching M is shown in bold. Find an alternating path. A B C D E F G H I 244 Algorithms (b) Prove that a matching M is maximal if and only if there does not exist an alternating path with respect to it. (c) Design an algorithm that nds an alternating path in O(jVj+jEj) time using a variant of breadth- rst search. (d) Give a direct O(jVj jEj) algorithm for nding a maximal matching in a bipartite graph. 7.25. The dual of maximum ow. Consider the following network with edge capacities. S B T A1 3 2 1 1 (a) Write the problem of nding the maximum ow from S to T as a linear program. (b) Write down the dual of this linear program. There should be a dual variable for each edge of the network and for each vertex other than S;T. Now we’ll solve the same problem in full generality. Recall the linear program for a general maximum ow problem (Section 7.2). (c) Write down the dual of this general ow LP, using a variable ye for each edge and xu for each vertex u6= s;t. (d) Show that any solution to the general dual LP must satisfy the following property: for any directed path from s to t in the network, the sum of the ye values along the path must be at least 1. (e) What are the intuitive meanings of the dual variables? Show that any s t cut in the network can be translated into a dual feasible solution whose cost is exactly the capacity of that cut. 7.26. In a satis able system of linear inequalities a11x1 + +a1nxn b1 ... am1x1 + +amnxn bm we describe the jth inequality as forced-equal if it is satis ed with equality by every solution x = (x1;:::;xn) of the system. Equivalently, Piajixi bj is not forced-equal if there exists an x that satis es the whole system and such that Pi ajixi 0. (a) Show that this is equivalent to nding an s t ow f that minimizes Pelefe subject to size(f) = 1. There are no capacity constraints. (b) Write the shortest path problem as a linear program. (c) Show that the dual LP can be written as max xs xt xu xv luv for all (u;v)2E (d) An interpretation for the dual is given in the box on page 223. Why isn’t our dual LP identical to the one on that page? 7.29. Hollywood. A lm producer is seeking actors and investors for his new movie. There are n available actors; actor i charges si dollars. For funding, there are m available investors. Investor j will provide pj dollars, but only on the condition that certain actors Lj f1;2;:::;ng are included in the cast (all of these actors Lj must be chosen in order to receive funding from investor j). The producer’s pro t is the sum of the payments from investors minus the payments to actors. The goal is to maximize this pro t. (a) Express this problem as an integer linear program in which the variables take on values f0;1g. (b) Now relax this to a linear program, and show that there must in fact be an integral optimal solution (as is the case, for example, with maximum ow and bipartite matching). 7.30. Hall’s theorem. Returning to the matchmaking scenario of Section 7.3, suppose we have a bipar- tite graph with boys on the left and an equal number of girls on the right. Hall’s theorem says that there is a perfect matching if and only if the following condition holds: any subset S of boys is connected to at leastjSjgirls. Prove this theorem. (Hint: The max- ow min-cut theorem should be helpful.) 7.31. Consider the following simple network with edge capacities as shown. 246 Algorithms S B T A 1 1000 1000 1000 1000 (a) Show that, if the Ford-Fulkerson algorithm is run on this graph, a careless choice of updates might cause it to take 1000 iterations. Imagine if the capacities were a million instead of 1000! We will now nd a strategy for choosing paths under which the algorithm is guaranteed to ter- minate in a reasonable number of iterations. Consider an arbitrary directed network (G = (V;E);s;t;fceg) in which we want to nd the max- imum ow. Assume for simplicity that all edge capacities are at least 1, and de ne the capacity of an s t path to be the smallest capacity of its constituent edges. The fattest path from s to t is the path with the most capacity. (b) Show that the fattest s t path in a graph can be computed by a variant of Dijkstra’s algorithm. (c) Show that the maximum ow in G is the sum of individual ows along at most jEj paths from s to t. (d) Now show that if we always increase ow along the fattest path in the residual graph, then the Ford-Fulkerson algorithm will terminate in at most O(jEjlogF) iterations, where F is the size of the maximum ow. (Hint: It might help to recall the proof for the greedy set cover algorithm in Section 5.4.) In fact, an even simpler rule nding a path in the residual graph using breadth- rst search guarantees that at most O(jVj jEj) iterations will be needed. Chapter 8 NP-complete problems 8.1 Search problems Over the past seven chapters we have developed algorithms for nding shortest paths and minimum spanning trees in graphs, matchings in bipartite graphs, maximum increasing sub- sequences, maximum ows in networks, and so on. All these algorithms are ef cient, because in each case their time requirement grows as a polynomial function (such as n, n2, or n3) of the size of the input. To better appreciate such ef cient algorithms, consider the alternative: In all these prob- lems we are searching for a solution (path, tree, matching, etc.) from among an exponential population of possibilities. Indeed, n boys can be matched with n girls in n! different ways, a graph with n vertices has nn 2 spanning trees, and a typical graph has an exponential num- ber of paths from s to t. All these problems could in principle be solved in exponential time by checking through all candidate solutions, one by one. But an algorithm whose running time is 2n, or worse, is all but useless in practice (see the next box). The quest for ef cient algorithms is about nding clever ways to bypass this process of exhaustive search, using clues from the input in order to dramatically narrow down the search space. So far in this book we have seen the most brilliant successes of this quest, algorithmic tech- niques that defeat the specter of exponentiality: greedy algorithms, dynamic programming, linear programming (while divide-and-conquer typically yields faster algorithms for problems we can already solve in polynomial time). Now the time has come to meet the quest’s most embarrassing and persistent failures. We shall see some other search problems, in which again we are seeking a solution with particular properties among an exponential chaos of al- ternatives. But for these new problems no shortcut seems possible. The fastest algorithms we know for them are all exponential not substantially better than an exhaustive search. We now introduce some important examples. 247 248 Algorithms The story of Sissa and Moore According to the legend, the game of chess was invented by the Brahmin Sissa to amuse and teach his king. Asked by the grateful monarch what he wanted in return, the wise man requested that the king place one grain of rice in the rst square of the chessboard, two in the second, four in the third, and so on, doubling the amount of rice up to the 64th square. The king agreed on the spot, and as a result he was the rst person to learn the valuable -albeit humbling lesson of exponential growth. Sissa’s request amounted to 264 1 = 18;446;744;073;709;551;615 grains of rice, enough rice to pave all of India several times over! All over nature, from colonies of bacteria to cells in a fetus, we see systems that grow exponentially for a while. In 1798, the British philosopher T. Robert Malthus published an essay in which he predicted that the exponential growth (he called it geometric growth ) of the human population would soon deplete linearly growing resources, an argument that in uenced Charles Darwin deeply. Malthus knew the fundamental fact that an exponential sooner or later takes over any polynomial. In 1965, computer chip pioneer Gordon E. Moore noticed that transistor density in chips had doubled every year in the early 1960s, and he predicted that this trend would continue. This prediction, moderated to a doubling every 18 months and extended to computer speed, is known as Moore’s law. It has held remarkably well for 40 years. And these are the two root causes of the explosion of information technology in the past decades: Moore’s law and ef cient algorithms. It would appear that Moore’s law provides a disincentive for developing polynomial al- gorithms. After all, if an algorithm is exponential, why not wait it out until Moore’s law makes it feasible? But in reality the exact opposite happens: Moore’s law is a huge incen- tive for developing ef cient algorithms, because such algorithms are needed in order to take advantage of the exponential increase in computer speed. Here is why. If, for example, an O(2n) algorithm for Boolean satis ability (SAT) were given an hour to run, it would have solved instances with 25 variables back in 1975, 31 vari- ables on the faster computers available in 1985, 38 variables in 1995, and about 45 variables with today’s machines. Quite a bit of progress except that each extra variable requires a year and a half’s wait, while the appetite of applications (many of which are, ironically, re- lated to computer design) grows much faster. In contrast, the size of the instances solved by an O(n) or O(nlogn) algorithm would be multiplied by a factor of about 100 each decade. In the case of an O(n2) algorithm, the instance size solvable in a xed time would be mul- tiplied by about 10 each decade. Even an O(n6) algorithm, polynomial yet unappetizing, would more than double the size of the instances solved each decade. When it comes to the growth of the size of problems we can attack with an algorithm, we have a reversal: expo- nential algorithms make polynomially slow progress, while polynomial algorithms advance exponentially fast! For Moore’s law to be re ected in the world we need ef cient algorithms. As Sissa and Malthus knew very well, exponential expansion cannot be sustained in- de nitely in our nite world. Bacterial colonies run out of food; chips hit the atomic scale. Moore’s law will stop doubling the speed of our computers within a decade or two. And then progress will depend on algorithmic ingenuity or otherwise perhaps on novel ideas such as quantum computation, explored in Chapter 10. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 249 Satis ability SATISFIABILITY, or SAT (recall Exercise 3.28 and Section 5.3), is a problem of great practical importance, with applications ranging from chip testing and computer design to image analy- sis and software engineering. It is also a canonical hard problem. Here’s what an instance of SAT looks like: (x_y_z) (x_y) (y_z) (z_x) (x_y_z): This is a Boolean formula in conjunctive normal form (CNF). It is a collection of clauses (the parentheses), each consisting of the disjunction (logical or, denoted_) of several literals, where a literal is either a Boolean variable (such as x) or the negation of one (such as x). A satisfying truth assignment is an assignment of false or true to each variable so that every clause contains a literal whose value is true. The SAT problem is the following: given a Boolean formula in conjunctive normal form, either nd a satisfying truth assignment or else report that none exists. In the instance shown previously, setting all variables to true, for example, satis es every clause except the last. Is there a truth assignment that satis es all clauses? With a little thought, it is not hard to argue that in this particular case no such truth assignment exists. (Hint: The three middle clauses constrain all three variables to have the same value.) But how do we decide this in general? Of course, we can always search through all truth assignments, one by one, but for formulas with n variables, the number of possible assignments is exponential, 2n. SAT is a typical search problem. We are given an instance I (that is, some input data specifying the problem at hand, in this case a Boolean formula in conjunctive normal form), and we are asked to nd a solution S (an object that meets a particular speci cation, in this case an assignment that satis es each clause). If no such solution exists, we must say so. More speci cally, a search problem must have the property that any proposed solution S to an instance I can be quickly checked for correctness. What does this entail? For one thing, S must at least be concise (quick to read), with length polynomially bounded by that of I. This is clearly true in the case of SAT, for which S is an assignment to the variables. To formalize the notion of quick checking, we will say that there is a polynomial-time algorithm that takes as input I and S and decides whether or not S is a solution of I. For SAT, this is easy as it just involves checking whether the assignment speci ed by S indeed satis es every clause in I. Later in this chapter it will be useful to shift our vantage point and to think of this ef cient algorithm for checking proposed solutions as de ning the search problem. Thus: A search problem is speci ed by an algorithmC that takes two inputs, an instance I and a proposed solution S, and runs in time polynomial in jIj. We say S is a solution to I if and only ifC(I;S) = true. Given the importance of the SAT search problem, researchers over the past 50 years have tried hard to nd ef cient ways to solve it, but without success. The fastest algorithms we have are still exponential on their worst-case inputs. Yet, interestingly, there are two natural variants of SAT for which we do have good algo- rithms. If all clauses contain at most one positive literal, then the Boolean formula is called 250 Algorithms Figure 8.1 The optimal traveling salesman tour, shown in bold, has length 18. 4 5 6 3 3 3 2 4 1 2 3 a Horn formula, and a satisfying truth assignment, if one exists, can be found by the greedy algorithm of Section 5.3. Alternatively, if all clauses have only two literals, then graph the- ory comes into play, and SAT can be solved in linear time by nding the strongly connected components of a particular graph constructed from the instance (recall Exercise 3.28). In fact, in Chapter 9, we’ll see a different polynomial algorithm for this same special case, which is called 2SAT. On the other hand, if we are just a little more permissive and allow clauses to contain three literals, then the resulting problem, known as 3SAT (an example of which we saw earlier), once again becomes hard to solve! The traveling salesman problem In the traveling salesman problem (TSP) we are given n vertices 1;:::;n and all n(n 1)=2 distances between them, as well as a budget b. We are asked to nd a tour, a cycle that passes through every vertex exactly once, of total cost b or less or to report that no such tour exists. That is, we seek a permutation (1);:::; (n) of the vertices such that when they are toured in this order, the total distance covered is at most b: d (1); (2) +d (2); (3) + +d (n); (1) b: See Figure 8.1 for an example (only some of the distances are shown; assume the rest are very large). Notice how we have de ned the TSP as a search problem: given an instance, nd a tour within the budget (or report that none exists). But why are we expressing the traveling salesman problem in this way, when in reality it is an optimization problem, in which the shortest possible tour is sought? Why dress it up as something else? For a good reason. Our plan in this chapter is to compare and relate problems. The framework of search problems is helpful in this regard, because it encompasses optimization problems like the TSP in addition to true search problems like SAT. Turning an optimization problem into a search problem does not change its dif culty at all, because the two versions reduce to one another. Any algorithm that solves the optimization S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 251 TSP also readily solves the search problem: nd the optimum tour and if it is within budget, return it; if not, there is no solution. Conversely, an algorithm for the search problem can also be used to solve the optimization problem. To see why, rst suppose that we somehow knew the cost of the optimum tour; then we could nd this tour by calling the algorithm for the search problem, using the optimum cost as the budget. Fine, but how do we nd the optimum cost? Easy: By binary search! (See Exercise 8.1.) Incidentally, there is a subtlety here: Why do we have to introduce a budget? Isn’t any optimization problem also a search problem in the sense that we are searching for a solution that has the property of being optimal? The catch is that the solution to a search problem should be easy to recognize, or as we put it earlier, polynomial-time checkable. Given a po- tential solution to the TSP, it is easy to check the properties is a tour (just check that each vertex is visited exactly once) and has total length b. But how could one check the property is optimal ? As with SAT, there are no known polynomial-time algorithms for the TSP, despite much effort by researchers over nearly a century. Of course, there is an exponential algorithm for solving it, by trying all (n 1)! tours, and in Section 6.6 we saw a faster, yet still exponential, dynamic programming algorithm. The minimum spanning tree (MST) problem, for which we do have ef cient algorithms, provides a stark contrast here. To phrase it as a search problem, we are again given a distance matrix and a bound b, and are asked to nd a tree T with total weight P(i;j)2T dij b. The TSP can be thought of as a tough cousin of the MST problem, in which the tree is not allowed to branch and is therefore a path.1 This extra restriction on the structure of the tree results in a much harder problem. Euler and Rudrata In the summer of 1735 Leonhard Euler (pronounced Oiler ), the famous Swiss mathemati- cian, was walking the bridges of the East Prussian town of Kcurrency1onigsberg. After a while, he noticed in frustration that, no matter where he started his walk, no matter how cleverly he continued, it was impossible to cross each bridge exactly once. And from this silly ambition, the eld of graph theory was born. Euler identi ed at once the roots of the park’s de ciency. First, you turn the map of the park into a graph whose vertices are the four land masses (two islands, two banks) and whose edges are the seven bridges: 1Actually the TSP demands a cycle, but one can de ne an alternative version that seeks a path, and it is not hard to see that this is just as hard as the TSP itself. 252 Algorithms Southern bank Northern bank Small island Big island This graph has multiple edges between two vertices a feature we have not been allowing so far in this book, but one that is meaningful for this particular problem, since each bridge must be accounted for separately. We are looking for a path that goes through each edge exactly once (the path is allowed to repeat vertices). In other words, we are asking this question: When can a graph be drawn without lifting the pencil from the paper? The answer discovered by Euler is simple, elegant, and intuitive: If and only if (a) the graph is connected and (b) every vertex, with the possible exception of two vertices (the start and nal vertices of the walk), has even degree (Exercise 3.26). This is why K currency1onigsberg’s park was impossible to traverse: all four vertices have odd degree. To put it in terms of our present concerns, let us de ne a search problem called EULER PATH: Given a graph, nd a path that contains each edge exactly once. It follows from Euler’s observation, and a little more thinking, that this search problem can be solved in polynomial time. Almost a millennium before Euler’s fateful summer in East Prussia, a Kashmiri poet named Rudrata had asked this question: Can one visit all the squares of the chessboard, without repeating any square, in one long walk that ends at the starting square and at each step makes a legal knight move? This is again a graph problem: the graph now has 64 ver- tices, and two squares are joined by an edge if a knight can go from one to the other in a single move (that is, if their coordinates differ by 2 in one dimension and by 1 in the other). See Figure 8.2 for the portion of the graph corresponding to the upper left corner of the board. Can you nd a knight’s tour on your chessboard? This is a different kind of search problem in graphs: we want a cycle that goes through all vertices (as opposed to all edges in Euler’s problem), without repeating any vertex. And there is no reason to stick to chessboards; this question can be asked of any graph. Let us de ne the RUDRATA CYCLE search problem to be the following: given a graph, nd a cycle that visits each vertex exactly once or report that no such cycle exists.2 This problem is ominously reminiscent of the TSP, and indeed no polynomial algorithm is known for it. There are two differences between the de nitions of the Euler and Rudrata problems. The rst is that Euler’s problem visits all edges while Rudrata’s visits all vertices. But there is 2In the literature this problem is known as the Hamilton cycle problem, after the great Irish mathematician who rediscovered it in the 19th century. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 253 Figure 8.2 Knight’s moves on a corner of a chessboard. also the issue that one of them demands a path while the other requires a cycle. Which of these differences accounts for the huge disparity in computational complexity between the two problems? It must be the rst, because the second difference can be shown to be purely cosmetic. Indeed, de ne the RUDRATA PATH problem to be just like RUDRATA CYCLE, except that the goal is now to nd a path that goes through each vertex exactly once. As we will soon see, there is a precise equivalence between the two versions of the Rudrata problem. Cuts and bisections A cut is a set of edges whose removal leaves a graph disconnected. It is often of interest to nd small cuts, and the MINIMUM CUT problem is, given a graph and a budget b, to nd a cut with at most b edges. For example, the smallest cut in Figure 8.3 is of size 3. This problem can be solved in polynomial time by n 1 max- ow computations: give each edge a capacity of 1, and nd the maximum ow between some xed node and every single other node. The smallest such ow will correspond (via the max- ow min-cut theorem) to the smallest cut. Can you see why? We’ve also seen a very different, randomized algorithm for this problem (page 150). In many graphs, such as the one in Figure 8.3, the smallest cut leaves just a singleton vertex on one side it consists of all edges adjacent to this vertex. Far more interesting are small cuts that partition the vertices of the graph into nearly equal-sized sets. More precisely, the BALANCED CUT problem is this: given a graph with n vertices and a budget b, partition the vertices into two sets S and T such that jSj;jTj n=3 and such that there are at most b edges between S and T. Another hard problem. Balanced cuts arise in a variety of important applications, such as clustering. Consider for example the problem of segmenting an image into its constituent components (say, an elephant standing in a grassy plain with a clear blue sky above). A good way of doing this is to create a graph with a node for each pixel of the image and to put an edge between nodes whose corresponding pixels are spatially close together and are also similar in color. A single 254 Algorithms Figure 8.3 What is the smallest cut in this graph? object in the image (like the elephant, say) then corresponds to a set of highly connected vertices in the graph. A balanced cut is therefore likely to divide the pixels into two clusters without breaking apart any of the primary constituents of the image. The rst cut might, for instance, separate the elephant on the one hand from the sky and from grass on the other. A further cut would then be needed to separate the sky from the grass. Integer linear programming Even though the simplex algorithm is not polynomial time, we mentioned in Chapter 7 that there is a different, polynomial algorithm for linear programming. Therefore, linear pro- gramming is ef ciently solvable both in practice and in theory. But the situation changes completely if, in addition to specifying a linear objective function and linear inequalities, we also constrain the solution (the values for the variables) to be integer. This latter problem is called INTEGER LINEAR PROGRAMMING (ILP). Let’s see how we might formulate it as a search problem. We are given a set of linear inequalities Ax b, where A is an m n matrix and b is an m-vector; an objective function speci ed by an n-vector c; and nally, a goal g (the counterpart of a budget in maximization problems). We want to nd a nonnegative integer n-vector x such that Ax b and c x g. But there is a redundancy here: the last constraint c x g is itself a linear inequality and can be absorbed into Ax b. So, we de ne ILP to be following search problem: given A and b, nd a nonnegative integer vector x satisfying the inequalities Ax b, or report that none exists. Despite the many crucial applications of this problem, and intense interest by researchers, no ef cient algorithm is known for it. There is a particularly clean special case of ILP that is very hard in and of itself: the goal is to nd a vector x of 0’s and 1’s satisfying Ax = 1, where A is an m nmatrix with 0 1 entries and 1 is the m-vector of all 1’s. It should be apparent from the reductions in Section 7.1.4 that this is indeed a special case of ILP. We call it ZERO-ONE EQUATIONS (ZOE). We have now introduced a number of important search problems, some of which are fa- miliar from earlier chapters and for which there are ef cient algorithms, and others which are different in small but crucial ways that make them very hard computational problems. To S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 255 Figure 8.4 A more elaborate matchmaking scenario. Each triple is shown as a triangular- shaped node joining boy, girl, and pet. Armadillo Bobcat Carol Beatrice AliceChet Bob Al Canary complete our story we will introduce a few more hard problems, which will play a role later in the chapter, when we relate the computational dif culty of all these problems. The reader is invited to skip ahead to Section 8.2 and then return to the de nitions of these problems as required. Three-dimensional matching Recall the BIPARTITE MATCHING problem: given a bipartite graph with n nodes on each side (the boys and the girls), nd a set of n disjoint edges, or decide that no such set exists. In Section 7.3, we saw how to ef ciently solve this problem by a reduction to maximum ow. However, there is an interesting generalization, called 3D MATCHING, for which no polyno- mial algorithm is known. In this new setting, there are n boys and n girls, but also n pets, and the compatibilities among them are speci ed by a set of triples, each containing a boy, a girl, and a pet. Intuitively, a triple (b;g;p) means that boy b, girl g, and pet p get along well together. We want to nd n disjoint triples and thereby create n harmonious households. Can you spot a solution in Figure 8.4? Independent set, vertex cover, and clique In the INDEPENDENT SET problem (recall Section 6.7) we are given a graph and an integer g, and the aim is to nd g vertices that are independent, that is, no two of which have an edge between them. Can you nd an independent set of three vertices in Figure 8.5? How about four vertices? We saw in Section 6.7 that this problem can be solved ef ciently on trees, but for general graphs no polynomial algorithm is known. There are many other search problems about graphs. In VERTEX COVER, for example, the input is a graph and a budget b, and the idea is to nd b vertices that cover (touch) every edge. Can you cover all edges of Figure 8.5 with seven vertices? With six? (And do you see the 256 Algorithms Figure 8.5 What is the size of the largest independent set in this graph? intimate connection to the INDEPENDENT SET problem?) VERTEX COVER is a special case of SET COVER, which we encountered in Chapter 5. In that problem, we are given a set E and several subsets of it, S1;:::;Sm, along with a budget b. We are asked to select b of these subsets so that their union is E. VERTEX COVER is the special case in which E consists of the edges of a graph, and there is a subset Si for each vertex, containing the edges adjacent to that vertex. Can you see why 3D MATCHING is also a special case of SET COVER? And nally there is the CLIQUE problem: given a graph and a goal g, nd a set of g ver- tices such that all possible edges between them are present. What is the largest clique in Figure 8.5? Longest path We know the shortest-path problem can be solved very ef ciently, but how about the LONGEST PATH problem? Here we are given a graph G with nonnegative edge weights and two distin- guished vertices s and t, along with a goal g. We are asked to nd a path from s to t with total weight at least g. Naturally, to avoid trivial solutions we require that the path be simple, containing no repeated vertices. No ef cient algorithm is known for this problem (which sometimes also goes by the name of TAXICAB RIP-OFF). Knapsack and subset sum Recall the KNAPSACK problem (Section 6.4): we are given integer weights w1;:::;wn and integer values v1;:::;vn for n items. We are also given a weight capacity W and a goal g (the former is present in the original optimization problem, the latter is added to make it a search problem). We seek a set of items whose total weight is at most W and whose total value is at least g. As always, if no such set exists, we should say so. In Section 6.4, we developed a dynamic programming scheme for KNAPSACK with running S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 257 time O(nW), which we noted is exponential in the input size, since it involves W rather than logW. And we have the usual exhaustive algorithm as well, which looks at all subsets of items all 2n of them. Is there a polynomial algorithm for KNAPSACK? Nobody knows of one. But suppose that we are interested in the variant of the knapsack problem in which the integers are coded in unary for instance, by writing IIIIIIIIIIII for 12. This is admittedly an exponentially wasteful way to represent integers, but it does de ne a legitimate problem, which we could call UNARY KNAPSACK. It follows from our discussion that this somewhat arti cial problem does have a polynomial algorithm. A different variation: suppose now that each item’s value is equal to its weight (all given in binary), and to top it off, the goal g is the same as the capacity W. (To adapt the silly break-in story whereby we rst introduced the knapsack problem, the items are all gold nuggets, and the burglar wants to ll his knapsack to the hilt.) This special case is tantamount to nding a subset of a given set of integers that adds up to exactly W. Since it is a special case of KNAPSACK, it cannot be any harder. But could it be polynomial? As it turns out, this problem, called SUBSET SUM, is also very hard. At this point one could ask: If SUBSET SUM is a special case that happens to be as hard as the general KNAPSACK problem, why are we interested in it? The reason is simplicity. In the complicated calculus of reductions between search problems that we shall develop in this chapter, conceptually simple problems like SUBSET SUM and 3SAT are invaluable. 8.2 NP-complete problems Hard problems, easy problems In short, the world is full of search problems, some of which can be solved ef ciently, while others seem to be very hard. This is depicted in the following table. Hard problems (NP-complete) Easy problems (in P) 3SAT 2SAT, HORN SAT TRAVELING SALESMAN PROBLEM MINIMUM SPANNING TREE LONGEST PATH SHORTEST PATH 3D MATCHING BIPARTITE MATCHING KNAPSACK UNARY KNAPSACK INDEPENDENT SET INDEPENDENT SET on trees INTEGER LINEAR PROGRAMMING LINEAR PROGRAMMING RUDRATA PATH EULER PATH BALANCED CUT MINIMUM CUT This table is worth contemplating. On the right we have problems that can be solved ef ciently. On the left, we have a bunch of hard nuts that have escaped ef cient solution over many decades or centuries. 258 Algorithms The various problems on the right can be solved by algorithms that are specialized and diverse: dynamic programming, network ow, graph search, greedy. These problems are easy for a variety of different reasons. In stark contrast, the problems on the left are all dif cult for the same reason! At their core, they are all the same problem, just in different disguises! They are all equivalent: as we shall see in Section 8.3, each of them can be reduced to any of the others and back. P and NP It’s time to introduce some important concepts. We know what a search problem is: its de n- ing characteristic is that any proposed solution can be quickly checked for correctness, in the sense that there is an ef cient checking algorithmC that takes as input the given instance I (the data specifying the problem to be solved), as well as the proposed solution S, and outputs true if and only if S really is a solution to instance I. Moreover the running time of C(I;S) is bounded by a polynomial injIj, the length of the instance. We denote the class of all search problems by NP. We’ve seen many examples of NP search problems that are solvable in polynomial time. In such cases, there is an algorithm that takes as input an instance I and has a running time polynomial in jIj. If I has a solution, the algorithm returns such a solution; and if I has no solution, the algorithm correctly reports so. The class of all search problems that can be solved in polynomial time is denoted P. Hence, all the search problems on the right-hand side of the table are in P. Why P and NP? Okay, P must stand for polynomial. But why use the initials NP (the common chatroom abbreviation for no problem ) to describe the class of search problems, some of which are terribly hard? NP stands for nondeterministic polynomial time, a term going back to the roots of complexity theory. Intuitively, it means that a solution to any search problem can be found and veri ed in polynomial time by a special (and quite unrealistic) sort of algorithm, called a nondeterministic algorithm. Such an algorithm has the power of guessing correctly at every step. Incidentally, the original de nition of NP (and its most common usage to this day) was not as a class of search problems, but as a class of decision problems: algorithmic questions that can be answered by yes or no. Example: Is there a truth assignment that satis es this Boolean formula? But this too re ects a historical reality: At the time the theory of NP- completeness was being developed, researchers in the theory of computation were interested in formal languages, a domain in which such decision problems are of central importance. Are there search problems that cannot be solved in polynomial time? In other words, is P 6= NP? Most algorithms researchers think so. It is hard to believe that exponential search can always be avoided, that a simple trick will crack all these hard problems, famously unsolved for decades and centuries. And there is a good reason for mathematicians to believe S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 259 that P 6= NP the task of nding a proof for a given mathematical assertion is a search problem and is therefore in NP (after all, when a formal proof of a mathematical statement is written out in excruciating detail, it can be checked mechanically, line by line, by an ef cient algorithm). So if P = NP, there would be an ef cient method to prove any theorem, thus eliminating the need for mathematicians! All in all, there are a variety of reasons why it is widely believed that P6= NP. However, proving this has turned out to be extremely dif cult, one of the deepest and most important unsolved puzzles of mathematics. Reductions, again Even if we accept that P 6= NP, what about the speci c problems on the left side of the table? On the basis of what evidence do we believe that these particular problems have no ef cient algorithm (besides, of course, the historical fact that many clever mathematicians and computer scientists have tried hard and failed to nd any)? Such evidence is provided by reductions, which translate one search problem into another. What they demonstrate is that the problems on the left side of the table are all, in some sense, exactly the same problem, except that they are stated in different languages. What’s more, we will also use reductions to show that these problems are the hardest search problems in NP if even one of them has a polynomial time algorithm, then every problem in NP has a polynomial time algorithm. Thus if we believe that P6= NP, then all these search problems are hard. We de ned reductions in Chapter 7 and saw many examples of them. Let’s now specialize this de nition to search problems. A reduction from search problem A to search problem B is a polynomial-time algorithm f that transforms any instance I of A into an instance f(I) of B, together with another polynomial-time algorithm h that maps any solution S of f(I) back into a solution h(S) of I; see the following diagram. If f(I) has no solution, then neither does I. These two translation procedures f and h imply that any algorithm for B can be converted into an algorithm for A by bracketing it between f and h. I Instance Instance f(I) f Algorithm for A for B Algorithm Solution S of f(I) No solution to f(I) No solution to I h(S) of I Solutionh And now we can nally de ne the class of the hardest search problems. A search problem is NP-complete if all other search problems reduce to it. This is a very strong requirement indeed. For a problem to be NP-complete, it must be useful in solving every search problem in the world! It is remarkable that such problems exist. But they do, and the rst column of the table we saw earlier is lled with the most famous examples. In Section 8.3 we shall see how all these problems reduce to one another, and also why all other search problems reduce to them. 260 Algorithms Figure 8.6 The space NP of all search problems, assuming P6= NP. NP− Increasing difficulty P complete The two ways to use reductions So far in this book the purpose of a reduction from a problem A to a problem B has been straightforward and honorable: We know how to solve B ef ciently, and we want to use this knowledge to solve A. In this chapter, however, reductions from A to B serve a somewhat perverse goal: we know A is hard, and we use the reduction to prove that B is hard as well! If we denote a reduction from A to B by A !B then we can say that dif culty ows in the direction of the arrow, while ef cient algorithms move in the opposite direction. It is through this propagation of dif culty that we know NP-complete problems are hard: all other search problems reduce to them, and thus each NP-complete problem contains the complexity of all search problems. If even one NP-complete problem is in P, then P = NP. Reductions also have the convenient property that they compose. If A !B and B !C, then A !C . To see this, observe rst of all that any reduction is completely speci ed by the pre- and postprocessing functions f and h (see the reduction diagram). If (fAB;hAB) and (fBC;hBC) de ne the reductions from A to B and from B to C, respectively, then a reduction from A to C is given by compositions of these functions: fBC fAB maps an instance of A to an instance of C and hAB hBC sends a solution of C back to a solution of A. This means that once we know a problem A is NP-complete, we can use it to prove that a new search problem B is also NP-complete, simply by reducing A to B. Such a reduction establishes that all problems in NP reduce to B, via A. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 261 Factoring One last point: we started off this book by introducing another famously hard search problem: FACTORING, the task of nding all prime factors of a given integer. But the dif culty of FACTORING is of a different nature than that of the other hard search problems we have just seen. For example, nobody believes that FACTORING is NP-complete. One major difference is that, in the case of FACTORING, the de nition does not contain the now familiar clause or report that none exists. A number can always be factored into primes. Another difference (possibly not completely unrelated) is this: as we shall see in Chap- ter 10, FACTORING succumbs to the power of quantum computation while SAT, TSP and the other NP-complete problems do not seem to. 262 Algorithms Figure 8.7 Reductions between search problems. 3D MATCHING RUDRATA CYCLESUBSET SUM TSP ILP ZOE All of NP SAT 3SAT VERTEX COVER INDEPENDENT SET CLIQUE 8.3 The reductions We shall now see that the search problems of Section 8.1 can be reduced to one another as depicted in Figure 8.7. As a consequence, they are all NP-complete. Before we tackle the speci c reductions in the tree, let’s warm up by relating two versions of the Rudrata problem. RUDRATA (s; t)-PATH !RUDRATA CYCLE Recall the RUDRATA CYCLE problem: given a graph, is there a cycle that passes through each vertex exactly once? We can also formulate the closely related RUDRATA (s;t)-PATH problem, in which two vertices s and t are speci ed, and we want a path starting at s and ending at t that goes through each vertex exactly once. Is it possible that RUDRATA CYCLE is easier than RUDRATA (s;t)-PATH? We will show by a reduction that the answer is no. The reduction maps an instance (G = (V;E);s;t) of RUDRATA (s;t)-PATH into an instance G0 = (V0;E0) of RUDRATA CYCLE as follows: G0 is simply G with an additional vertex x and two new edgesfs;xgandfx;tg. For instance: G G0 s tt s x S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 263 So V0 = V [fxg, and E0 = E[ffs;xg;fx;tgg. How do we recover a Rudrata (s;t)-path in G given any Rudrata cycle in G0? Easy, we just delete the edgesfs;xgandfx;tgfrom the cycle. Instance: nodes s;t G = (V; E) fs;xg;fx;tg G0 = (V 0;E0) RUDRATA CYCLE and edges fs;xg;fx;tg No solution Solution: pathAdd node x Solution: cycle No solution Delete edges RUDRATA (s;t)-PATH To con rm the validity of this reduction, we have to show that it works in the case of either outcome depicted. 1. When the instance of RUDRATA CYCLE has a solution. Since the new vertex x has only two neighbors, s and t, any Rudrata cycle in G0 must consec- utively traverse the edges ft;xg and fx;sg. The rest of the cycle then traverses every other vertex en route from s to t. Thus deleting the two edges ft;xg and fx;sg from the Rudrata cycle gives a Rudrata path from s to t in the original graph G. 2. When the instance of RUDRATA CYCLE does not have a solution. In this case we must show that the original instance of RUDRATA (s;t)-PATH cannot have a solution either. It is usually easier to prove the contrapositive, that is, to show that if there is a Rudrata (s;t)-path in G, then there is also a Rudrata cycle in G0. But this is easy: just add the two edgesft;xgandfx;sgto the Rudrata path to close the cycle. One last detail, crucial but typically easy to check, is that the pre- and postprocessing functions take time polynomial in the size of the instance (G;s;t). It is also possible to go in the other direction and reduce RUDRATA CYCLE to RUDRATA (s;t)-PATH. Together, these reductions demonstrate that the two Rudrata variants are in essence the same problem which is not too surprising, given that their descriptions are al- most the same. But most of the other reductions we will see are between pairs of problems that, on the face of it, look quite different. To show that they are essentially the same, our reductions will have to cleverly translate between them. 3SAT !INDEPENDENT SET One can hardly think of two more different problems. In 3SAT the input is a set of clauses, each with three or fewer literals, for example (x_y_z) (x_y_z) (x_y_z) (x_y); and the aim is to nd a satisfying truth assignment. In INDEPENDENT SET the input is a graph and a number g, and the problem is to nd a set of g pairwise non-adjacent vertices. We must somehow relate Boolean logic with graphs! 264 Algorithms Figure 8.8 The graph corresponding to (x_y_z) (x_y_z) (x_y_z) (x_y). y y y x z x z xz x y Let us think. To form a satisfying truth assignment we must pick one literal from each clause and give it the value true. But our choices must be consistent: if we choose x in one clause, we cannot choose x in another. Any consistent choice of literals, one from each clause, speci es a truth assignment (variables for which neither literal has been chosen can take on either value). So, let us represent a clause, say (x_y_z), by a triangle, with vertices labeled x;y;z. Why triangle? Because a triangle has its three vertices maximally connected, and thus forces us to pick only one of them for the independent set. Repeat this construction for all clauses a clause with two literals will be represented simply by an edge joining the literals. (A clause with one literal is silly and can be removed in a preprocessing step, since the value of the variable is determined.) In the resulting graph, an independent set has to pick at most one literal from each group (clause). To force exactly one choice from each clause, take the goal g to be the number of clauses; in our example, g = 4. All that is missing now is a way to prevent us from choosing opposite literals (that is, both x and x) in different clauses. But this is easy: put an edge between any two vertices that correspond to opposite literals. The resulting graph for our example is shown in Figure 8.8. Let’s recap the construction. Given an instance I of 3SAT, we create an instance (G;g) of INDEPENDENT SET as follows. Graph G has a triangle for each clause (or just an edge, if the clause has two literals), with vertices labeled by the clause’s literals, and has additional edges between any two vertices that represent opposite literals. The goal g is set to the number of clauses. Clearly, this construction takes polynomial time. However, recall that for a reduction we do not just need an ef cient way to map instances of the rst problem to instances of the second (the function f in the diagram on page 259), but also a way to reconstruct a solution to the rst instance from any solution of the second (the function h). As always, there are two things to show. 1. Given an independent set S of g vertices in G, it is possible to ef ciently recover a satis- fying truth assignment to I. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 265 For any variable x, the set S cannot contain vertices labeled both x and x, because any such pair of vertices is connected by an edge. So assign x a value of true if S contains a vertex labeled x, and a value of false if S contains a vertex labeled x (if S contains neither, then assign either value to x). Since S has g vertices, it must have one vertex per clause; this truth assignment satis es those particular literals, and thus satis es all clauses. 2. If graph G has no independent set of size g, then the Boolean formula I is unsatis able. It is usually cleaner to prove the contrapositive, that if I has a satisfying assignment then G has an independent set of size g. This is easy: for each clause, pick any literal whose value under the satisfying assignment is true (there must be at least one such literal), and add the corresponding vertex to S. Do you see why set S must be independent? SAT !3SAT This is an interesting and common kind of reduction, from a problem to a special case of itself. We want to show that the problem remains hard even if its inputs are restricted somehow in the present case, even if all clauses are restricted to have 3 literals. Such reductions modify the given instance so as to get rid of the forbidden feature (clauses with 4 literals) while keeping the instance essentially the same, in that we can read off a solution to the original instance from any solution of the modi ed one. Here’s the trick for reducing SAT to 3SAT: given an instance I of SAT, use exactly the same instance for 3SAT, except that any clause with more than three literals, (a1 _a2 _ _ak) (where the ai’s are literals and k> 3), is replaced by a set of clauses, (a1_a2_y1) (y1_a3_y2) (y2_a4_y3) (yk 3_ak 1_ak); where the yi’s are new variables. Call the resulting 3SAT instance I0. The conversion from I to I0 is clearly polynomial time. Why does this reduction work? I0 is equivalent to I in terms of satis ability, because for any assignment to the ai’s, (a 1_a2_ _ak) is satis ed () 8 < : there is a setting of the yi’s for which (a1_a2_y1) (y1_a3_y2) (yk 3_ak 1_ak) are all satis ed 9 = ; To see this, rst suppose that the clauses on the right are all satis ed. Then at least one of the literals a1;:::;ak must be true otherwise y1 would have to be true, which would in turn force y2 to be true, and so on, eventually falsifying the last clause. But this means (a1_a2_ _ak) is also satis ed. Conversely, if (a1_a2_ _ak) is satis ed, then some ai must be true. Set y1;:::;yi 2 to true and the rest to false. This ensures that the clauses on the right are all satis ed. Thus, any instance of SAT can be transformed into an equivalent instance of 3SAT. In fact, 3SAT remains hard even under the further restriction that no variable appears in more than 266 Algorithms Figure 8.9 S is a vertex cover if and only if V S is an independent set. S three clauses. To show this, we must somehow get rid of any variable that appears too many times. Here’s the reduction from 3SAT to its constrained version. Suppose that in the 3SAT in- stance, variable x appears in k> 3 clauses. Then replace its rst appearance by x1, its second appearance by x2, and so on, replacing each of its k appearances by a different new variable. Finally, add the clauses (x1_x2) (x2_x3) (xk_x1): And repeat for every variable that appears more than three times. It is easy to see that in the new formula no variable appears more than three times (and in fact, no literal appears more than twice). Furthermore, the extra clauses involv- ing x1;x2;:::;xk constrain these variables to have the same value; do you see why? Hence the original instance of 3SAT is satis able if and only if the constrained instance is satis able. INDEPENDENT SET !VERTEX COVER Some reductions rely on ingenuity to relate two very different problems. Others simply record the fact that one problem is a thin disguise of another. To reduce INDEPENDENT SET to VERTEX COVER we just need to notice that a set of nodes S is a vertex cover of graph G = (V;E) (that is, S touches every edge in E) if and only if the remaining nodes, V S, are an independent set of G (Figure 8.9). Therefore, to solve an instance (G;g) of INDEPENDENT SET, simply look for a vertex cover of G withjVj g nodes. If such a vertex cover exists, then take all nodes not in it. If no such vertex cover exists, then G cannot possibly have an independent set of size g. INDEPENDENT SET !CLIQUE INDEPENDENT SET and CLIQUE are also easy to reduce to one another. De ne the complement of a graph G = (V;E) to be G = (V;E), where E contains precisely those unordered pairs of vertices that are not in E. Then a set of nodes S is an independent set of G if and only if S is a clique of G. To paraphrase, these nodes have no edges between them in G if and only if they have all possible edges between them in G. Therefore, we can reduce INDEPENDENT SET to CLIQUE by mapping an instance (G;g) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 267 of INDEPENDENT SET to the corresponding instance (G;g) of CLIQUE; the solution to both is identical. 3SAT !3D MATCHING Again, two very different problems. We must reduce 3SAT to the problem of nding, among a set of boy-girl-pet triples, a subset that contains each boy, each girl, and each pet exactly once. In short, we must design sets of boy-girl-pet triples that somehow behave like Boolean variables and gates! Consider the following set of four triples, each represented by a triangular node joining a boy, girl, and pet: p1 p3 g0 g1 b1 b0 p0 p2 Suppose that the two boys b0 and b1 and the two girls g0 and g1 are not involved in any other triples. (The four pets p0;:::;p3 will of course belong to other triples as well; for otherwise the instance would trivially have no solution.) Then any matching must contain either the two triples (b0;g1;p0);(b1;g0;p2) or the two triples (b0;g0;p1);(b1;g1;p3), because these are the only ways in which these two boys and girls can nd any match. Therefore, this gadget has two possible states: it behaves like a Boolean variable! To then transform an instance of 3SAT to one of 3D MATCHING, we start by creating a copy of the preceding gadget for each variable x. Call the resulting nodes px1;bx0;gx1, and so on. The intended interpretation is that boy bx0 is matched with girl gx1 if x = true, and with girl gx0 if x = false. Next we must create triples that somehow mimic clauses. For each clause, sayc = (x_y_z), introduce a new boy bc and a new girl gc. They will be involved in three triples, one for each literal in the clause. And the pets in these triples must re ect the three ways whereby the clause can be satis ed: (1) x = true, (2) y = false, (3) z = true. For (1), we have the triple (bc;gc;px1), where px1 is the pet p1 in the gadget for x. Here is why we chose p1: if x = true, then bx0 is matched with gx1 and bx1 with gx0, and so pets px0 and px2 are taken. In which case bc and gc can be matched with px1. But if x = false, then px1 and px3 are taken, and so gc and bc cannot be accommodated this way. We do the same thing for the other two literals of the 268 Algorithms clause, which yield triples involving bc and gc with either py0 or py2 (for the negated variable y) and with either pz1 or pz3 (for variable z). We have to make sure that for every occurrence of a literal in a clause c there is a different pet to match with bc and gc. But this is easy: by an earlier reduction we can assume that no literal appears more than twice, and so each variable gadget has enough pets, two for negated occurrences and two for unnegated. The reduction now seems complete: from any matching we can recover a satisfying truth assignment by simply looking at each variable gadget and seeing with which girl bx0 was matched. And from any satisfying truth assignment we can match the gadget corresponding to each variable x so that triples (bx0;gx1;px0) and (bx1;gx0;px2) are chosen if x = true and triples (bx0;gx0;px1) and (bx1;gx1;px3) are chosen if x = false; and for each clause c match bc and gc with the pet that corresponds to one of its satisfying literals. But one last problem remains: in the matching de ned at the end of the last paragraph, some pets may be left unmatched. In fact, if there are n variables and m clauses, then exactly 2n m pets will be left unmatched (you can check that this number is sure to be positive, because we have at most three occurrences of every variable, and at least two literals in every clause). But this is easy to x: Add 2n m new boy-girl couples that are generic animal- lovers, and match them by triples with all the pets! 3D MATCHING !ZOE Recall that in ZOE we are given an m n matrix A with 0 1 entries, and we must nd a 0 1 vector x = (x1;:::;xn) such that the m equations Ax = 1 are satis ed, where by 1 we denote the column vector of all 1’s. How can we express the 3D MATCHING problem in this framework? ZOE and ILP are very useful problems precisely because they provide a format in which many combinatorial problems can be expressed. In such a formulation we think of the 0 1 variables as describing a solution, and we write equations expressing the constraints of the problem. For example, here is how we express an instance of 3D MATCHING (m boys, m girls, m pets, and n boy-girl-pet triples) in the language of ZOE. We have 0 1 variables x1;:::;xn, one per triple, where xi = 1 means that the ith triple is chosen for the matching, and xi = 0 means that it is not chosen. Now all we have to do is write equations stating that the solution described by the xi’s is a legitimate matching. For each boy (or girl, or pet), suppose that the triples containing him (or her, or it) are those numbered j1;j2;:::;jk; the appropriate equation is then xj1 +xj2 + +xjk = 1; which states that exactly one of these triples must be included in the matching. For example, here is the A matrix for an instance of 3D MATCHING we saw earlier. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 269 Armadillo Bobcat Carol Beatrice AliceChet Bob Al Canary A = 0 BB BB BB BB BB BB @ 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 CC CC CC CC CC CC A The ve columns of A correspond to the ve triples, while the nine rows are for Al, Bob, Chet, Alice, Beatrice, Carol, Armadillo, Bobcat, and Canary, respectively. It is straightforward to argue that solutions to the two instances translate back and forth. ZOE !SUBSET SUM This is a reduction between two special cases of ILP: one with many equations but only 0 1 coef cients, and the other with a single equation but arbitrary integer coef cients. The reduction is based on a simple and time-honored idea: 0 1 vectors can encode numbers! For example, given this instance of ZOE: A = 0 BB BB @ 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 CC CC A ; we are looking for a set of columns of A that, added together, make up the all-1’s vector. But if we think of the columns as binary integers (read from top to bottom), we are looking for a subset of the integers 18;5;4;8 that add up to the binary integer 111112 = 31. And this is an instance of SUBSET SUM. The reduction is complete! Except for one detail, the one that usually spoils the close connection between 0 1 vec- tors and binary integers: carry. Because of carry, 5-bit binary integers can add up to 31 (for example, 5 + 6 + 20 = 31 or, in binary, 001012 + 001102 + 101002 = 111112) even when the sum of the corresponding vectors is not (1;1;1;1;1). But this is easy to x: Think of the column vectors not as integers in base 2, but as integers in base n+ 1 one more than the number of columns. This way, since at most n integers are added, and all their digits are 0 and 1, there can be no carry, and our reduction works. ZOE !ILP 3SAT is a special case of SAT or, SAT is a generalization of 3SAT. By special case we mean that the instances of 3SAT are a subset of the instances of SAT (in particular, the ones with no long clauses), and the de nition of solution is the same in both problems (an assignment 270 Algorithms Figure 8.10 Rudrata cycle with paired edges: C =f(e1;e3);(e5;e6);(e4;e5);(e3;e7);(e3;e8)g. e7 e1 e5 e4 e8 e3 e2 e6 satisfying all clauses). Consequently, there is a reduction from 3SAT to SAT, in which the input undergoes no transformation, and the solution to the target instance is also kept unchanged. In other words, functions f and h from the reduction diagram (on page 259) are both the identity. This sounds trivial enough, but it is a very useful and common way of establishing that a problem is NP-complete: Simply notice that it is a generalization of a known NP-complete problem. For example, the SET COVER problem is NP-complete because it is a generaliza- tion of VERTEX COVER (and also, incidentally, of 3D MATCHING). See Exercise 8.10 for more examples. Often it takes a little work to establish that one problem is a special case of another. The reduction from ZOE to ILP is a case in point. In ILP we are looking for an integer vector x that satis es Ax b, for given matrix A and vector b. To write an instance of ZOE in this precise form, we need to rewrite each equation of the ZOE instance as two inequalities (recall the transformations of Section 7.1.4), and to add for each variable xi the inequalities xi 1 and xi 0. ZOE !RUDRATA CYCLE In the RUDRATA CYCLE problem we seek a cycle in a graph that visits every vertex exactly once. We shall prove it NP-complete in two stages: rst we will reduce ZOE to a generalization of RUDRATA CYCLE, called RUDRATA CYCLE WITH PAIRED EDGES, and then we shall see how to get rid of the extra features of that problem and reduce it to the plain RUDRATA CYCLE problem. In an instance of RUDRATA CYCLE WITH PAIRED EDGES we are given a graph G = (V;E) and a set C E E of pairs of edges. We seek a cycle that (1) visits all vertices once, like a Rudrata cycle should, and (2) for every pair of edges (e;e0) in C, traverses either edge e or edge e0 exactly one of them. In the simple example of Figure 8.10 a solution is shown in bold. Notice that we allow two or more parallel edges between two nodes a feature that doesn’t S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 271 Figure 8.11 Reducing ZOE to RUDRATA CYCLE WITH PAIRED EDGES. variablesequations make sense in most graph problems since now the different copies of an edge can be paired with other copies of edges in ways that do make a difference. Now for the reduction of ZOE to RUDRATA CYCLE WITH PAIRED EDGES. Given an instance of ZOE, Ax = 1 (where A is anm n matrix with 0 1 entries, and thus describes m equations in n variables), the graph we construct has the very simple structure shown in Figure 8.11: a cycle that connectsm+ncollections of parallel edges. For each variablexi we have two parallel edges (corresponding to xi = 1 and xi = 0). And for each equation xj1 + +xjk = 1 involving k variables we have k parallel edges, one for every variable appearing in the equation. This is the whole graph. Evidently, any Rudrata cycle in this graph must traverse the m + n collections of parallel edges one by one, choosing one edge from each collection. This way, the cycle chooses for each variable a value 0 or 1 and, for each equation, a variable appearing in it. The whole reduction can’t be this simple, of course. The structure of the matrix A (and not just its dimensions) must be re ected somewhere, and there is one place left: the set C of pairs of edges such that exactly one edge in each pair is traversed. For every equation (recall there are m in total), and for every variable xi appearing in it, we add to C the pair (e;e0) where e is the edge corresponding to the appearance of xi in that particular equation (on the left-hand side of Figure 8.11), and e0 is the edge corresponding to the variable assignment xi = 0 (on the right side of the gure). This completes the construction. Take any solution of this instance of RUDRATA CYCLE WITH PAIRED EDGES. As discussed before, it picks a value for each variable and a variable for every equation. We claim that the values thus chosen are a solution to the original instance of ZOE. If a variable xi has value 1, then the edge xi = 0 is not traversed, and thus all edges associated with xi on the equation 272 Algorithms side must be traversed (since they are paired in C with the xi = 0 edge). So, in each equation exactly one of the variables appearing in it has value 1 which is the same as saying that all equations are satis ed. The other direction is straightforward as well: from a solution to the instance of ZOE one easily obtains an appropriate Rudrata cycle. Getting Rid of the Edge Pairs. So far we have a reduction from ZOE to RUDRATA CYCLE WITH PAIRED EDGES; but we are really interested in RUDRATA CYCLE, which is a special case of the problem with paired edges: the one in which the set of pairs C is empty. To accomplish our goal, we need, as usual, to nd a way of getting rid of the unwanted feature in this case the edge pairs. Consider the graph shown in Figure 8.12, and suppose that it is a part of a larger graph G in such a way that only the four endpoints a;b;c;d touch the rest of the graph. We claim that this graph has the following important property: in any Rudrata cycle of G the subgraph shown must be traversed in one of the two ways shown in bold in Figure 8.12(b) and (c). Here is why. Suppose that the cycle rst enters the subgraph from vertex a continuing to f. Then it must continue to vertex g, because g has degree 2 and so it must be visited immediately after one of its adjacent nodes is visited otherwise there is no way to include it in the cycle. Hence we must go on to node h, and here we seem to have a choice. We could continue on to j, or return to c. But if we take the second option, how are we going to visit the rest of the subgraph? (A Rudrata cycle must leave no vertex unvisited.) It is easy to see that this would be impossible, and so from h we have no choice but to continue to j and from there to visit the rest of the graph as shown in Figure 8.12(b). By symmetry, if the Rudrata cycle enters this subgraph at c, it must traverse it as in Figure 8.12(c). And these are the only two ways. But this property tells us something important: this gadget behaves just like two edges fa;bgandfc;dgthat are paired up in the RUDRATA CYCLE WITH PAIRED EDGES problem (see Figure 8.12(d)). The rest of the reduction is now clear: to reduce RUDRATA CYCLE WITH PAIRED EDGES to RUDRATA CYCLE we go through the pairs inC one by one. To get rid of each pair (fa;bg;fc;dg) we replace the two edges with the gadget in Figure 8.12(a). For any other pair in C that involvesfa;bg, we replace the edgefa;bgwith the new edgefa;fg, where f is from the gadget: the traversal of fa;fg is from now on an indication that edge fa;bg in the old graph would be traversed. Similarly, fc;hg replaces fc;dg. After jCj such replacements (performed in polynomial time, since each replacement adds only 12 vertices to the graph) we are done, and the Rudrata cycles in the resulting graph will be in one-to-one correspondence with the Rudrata cycles in the original graph that conform to the constraints in C. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 273 Figure 8.12 A gadget for enforcing paired behavior. (a) a c f m s b d qpjh g l k n r (b) a c b d (c) a c b d (d) a c b d C =f(fa;bg;fc;dg)g 274 Algorithms RUDRATA CYCLE !TSP Given a graph G = (V;E), construct the following instance of the TSP: the set of cities is the same as V, and the distance between cities u and v is 1 if fu;vg is an edge of G and 1 + otherwise, for some > 1 to be determined. The budget of the TSP instance is equal to the number of nodes,jVj. It is easy to see that if G has a Rudrata cycle, then the same cycle is also a tour within the budget of the TSP instance; and that conversely, if G has no Rudrata cycle, then there is no solution: the cheapest possible TSP tour has cost at least n+ (it must use at least one edge of length 1+ , and the total length of all n 1 others is at least n 1). Thus RUDRATA CYCLE reduces to TSP. In this reduction, we introduced the parameter because by varying it, we can obtain two interesting results. If = 1, then all distances are either 1 or 2, and so this instance of the TSP satis es the triangle inequality: if i;j;k are cities, then dij + djk dik (proof: a + b c holds for any numbers 1 a;b;c 2). This is a special case of the TSP which is of practical importance and which, as we shall see in Chapter 9, is in a certain sense easier, because it can be ef ciently approximated. If on the other hand is large, then the resulting instance of the TSP may not satisfy the triangle inequality, but has another important property: either it has a solution of cost n or less, or all its solutions have cost at least n+ (which now can be arbitrarily larger than n). There can be nothing in between! As we shall see in Chapter 9, this important gap property implies that, unless P = NP, no approximation algorithm is possible. ANY PROBLEM IN NP !SAT We have reduced SAT to the various search problems in Figure 8.7. Now we come full circle and argue that all these problems and in fact all problems in NP reduce to SAT. In particular, we shall show that all problems in NP can be reduced to a generalization of SAT which we call CIRCUIT SAT. In CIRCUIT SAT we are given a (Boolean) circuit (see Figure 8.13, and recall Section 7.7), a dag whose vertices are gates of ve different types: AND gates and OR gates have indegree 2. NOT gates have indegree 1. Known input gates have no incoming edges and are labeled false or true. Unknown input gates have no incoming edges and are labeled ? . One of the sinks of the dag is designated as the output gate. Given an assignment of values to the unknown inputs, we can evaluate the gates of the circuit in topological order, using the rules of Boolean logic (such as false_true = true), until we obtain the value at the output gate. This is the value of the circuit for the particular assignment to the inputs. For instance, the circuit in Figure8.13 evaluates to false under the assignment true;false;true (from left to right). CIRCUIT SAT is then the following search problem: Given a circuit, nd a truth assignment for the unknown inputs such that the output gate evaluates to true, or report that no such S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 275 Figure 8.13 An instance of CIRCUIT SAT. true AND NOT AND OR OR ? ? output ? AND assignment exists. For example, if presented with the circuit in Figure 8.13 we could have returned the assignment (false;true;true) because, if we substitute these values to the unknown inputs (from left to right), the output becomes true. CIRCUIT SAT is a generalization of SAT. To see why, notice that SAT asks for a satisfying truth assignment for a circuit that has this simple structure: a bunch of AND gates at the top join the clauses, and the result of this big AND is the output. Each clause is the OR of its literals. And each literal is either an unknown input gate or the NOT of one. There are no known input gates. Going in the other direction, CIRCUIT SAT can also be reduced to SAT. Here is how we can rewrite any circuit in conjunctive normal form (the AND of clauses): for each gate g in the circuit we create a variable g, and we model the effect of the gate using a few clauses: Gate g g g g g AND NOTOR h1 h1h2 h2 h falsetrue (g) (g) (g_h2) (g_h1) (g_h1_h2) (g_h1) (g_h2) (g_h) (g_h) (g_h1_h2) (Do you see that these clauses do, in fact, force exactly the desired effect?) And to nish up, if g is the output gate, we force it to be true by adding the clause (g). The resulting instance 276 Algorithms of SAT is equivalent to the given instance of CIRCUIT SAT: the satisfying truth assignments of this conjunctive normal form are in one-to-one correspondence with those of the circuit. Now that we know CIRCUIT SAT reduces to SAT, we turn to our main job, showing that all search problems reduce to CIRCUIT SAT. So, suppose that A is a problem in NP. We must discover a reduction from A to CIRCUIT SAT. This sounds very dif cult, because we know almost nothing about A! All we know about A is that it is a search problem, so we must put this knowledge to work. The main feature of a search problem is that any solution to it can quickly be checked: there is an algorithmC that checks, given an instance I and a proposed solution S, whether or not S is a solution of I. Moreover,C makes this decision in time polynomial in the length of I (we can assume that S is itself encoded as a binary string, and we know that the length of this string is polynomial in the length of I). Recall now our argument in Section 7.7 that any polynomial algorithm can be rendered as a circuit, whose input gates encode the input to the algorithm. Naturally, for any input length (number of input bits) the circuit will be scaled to the appropriate number of inputs, but the total number of gates of the circuit will be polynomial in the number of inputs. If the polynomial algorithm in question solves a problem that requires a yes or no answer (as is the situation withC: Does S encode a solution to the instance encoded by I? ), then this answer is given at the output gate. We conclude that, given any instance I of problem A, we can construct in polynomial time a circuit whose known inputs are the bits of I, and whose unknown inputs are the bits of S, such that the output is true if and only if the unknown inputs spell a solution S of I. In other words, the satisfying truth assignments to the unknown inputs of the circuit are in one-to-one correspondence with the solutions of instance I of A. The reduction is complete. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 277 Unsolvable problems At least an NP-complete problem can be solved by some algorithm the trouble is that this algorithm will be exponential. But it turns out there are perfectly decent computational problems for which no algorithms exist at all! One famous problem of this sort is an arithmetical version of SAT. Given a polynomial equation in many variables, perhaps x3yz + 2y4z2 7xy5z = 6; are there integer values of x;y;z that satisfy it? There is no algorithm that solves this problem. No algorithm at all, polynomial, exponential, doubly exponential, or worse! Such problems are called unsolvable. The rst unsolvable problem was discovered in 1936 by Alan M. Turing, then a student of mathematics at Cambridge, England. ] When Turing came up with it, there were no computers or programming languages (in fact, it can be argued that these things came about later exactly because this brilliant thought occurred to Turing). But today we can state it in familiar terms. Suppose that you are given a program in your favorite programming language, along with a particular input. Will the program ever terminate, once started on this input? This is a very reasonable question. Many of us would be ecstatic if we had an algorithm, call it terminates(p,x), that took as input a le containing a program p, and a le of data x, and after grinding away, nally told us whether or not p would ever stop if started on x. But how would you go about writing the program terminates? (If you haven’t seen this before, it’s worth thinking about it for a while, to appreciate the dif culty of writing such an universal in nite-loop detector. ) Well, you can’t. Such an algorithm does not exist! And here is the proof: Suppose we actually had such a program terminates(p,x). Then we could use it as a subroutine of the following evil program: function paradox(z:file) 1: if terminates(z,z) goto 1 Notice what paradox does: it terminates if and only if program z does not terminate when given its own code as input. You should smell trouble. What if we put this program in a le named paradox and we executed paradox(paradox)? Would this execution ever stop? Or not? Neither answer is possible. Since we arrived at this contradiction by assuming that there is an algorithm for telling whether programs terminate, we must conclude that this problem cannot be solved by any algorithm. By the way, all this tells us something important about programming: It will never be automated, it will forever depend on discipline, ingenuity, and hackery. We now know that you can’t tell whether a program has an in nite loop. But can you tell if it has a buffer overrun? Do you see how to use the unsolvability of the halting problem to show that this, too, is unsolvable? 278 Algorithms Exercises 8.1. Optimization versus search. Recall the traveling salesman problem: TSP Input: A matrix of distances; a budget b Output: A tour which passes through all the cities and has length b, if such a tour exists. The optimization version of this problem asks directly for the shortest tour. TSP-OPT Input: A matrix of distances Output: The shortest tour which passes through all the cities. Show that if TSP can be solved in polynomial time, then so can TSP-OPT. 8.2. Search versus decision. Suppose you have a procedure which runs in polynomial time and tells you whether or not a graph has a Rudrata path. Show that you can use it to develop a polynomial- time algorithm for RUDRATA PATH (which returns the actual path, if it exists). 8.3. STINGY SAT is the following problem: given a set of clauses (each a disjunction of literals) and an integer k, nd a satisfying assignment in which at most k variables are true, if such an assignment exists. Prove that STINGY SAT is NP-complete. 8.4. Consider the CLIQUE problem restricted to graphs in which every vertex has degree at most 3. Call this problem CLIQUE-3. (a) Prove that CLIQUE-3 is in NP. (b) What is wrong with the following proof of NP-completeness for CLIQUE-3? We know that the CLIQUE problem in general graphs is NP-complete, so it is enough to present a reduction from CLIQUE-3 to CLIQUE. Given a graphG with vertices of degree 3, and a parameter g, the reduction leaves the graph and the parameter unchanged: clearly the output of the reduction is a possible input for the CLIQUE problem. Furthermore, the answer to both problems is identical. This proves the correctness of the reduction and, therefore, the NP-completeness of CLIQUE-3. (c) It is true that the VERTEX COVER problem remains NP-complete even when restricted to graphs in which every vertex has degree at most 3. Call this problem VC-3. What is wrong with the following proof of NP-completeness for CLIQUE-3? We present a reduction from VC-3 to CLIQUE-3. Given a graphG = (V;E) with node degrees bounded by 3, and a parameter b, we create an instance of CLIQUE-3 by leaving the graph unchanged and switching the parameter tojVj b. Now, a subset C V is a vertex cover in G if and only if the complementary set V C is a clique in G. Therefore G has a vertex cover of size b if and only if it has a clique of size jVj b. This proves the correctness of the reduction and, consequently, the NP-completeness of CLIQUE-3. (d) Describe an O(jVj4) algorithm for CLIQUE-3. 8.5. Give a simple reduction from 3D MATCHING to SAT, and another from RUDRATA CYCLE to SAT. (Hint: In the latter case you may use variables xij whose intuitive meaning is vertex i is the jth vertex of the Hamilton cycle ; you then need to write clauses that express the constraints of the problem.) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 279 8.6. On page 266 we saw that 3SAT remains NP-complete even when restricted to formulas in which each literal appears at most twice. (a) Show that if each literal appears at most once, then the problem is solvable in polynomial time. (b) Show that INDEPENDENT SET remains NP-complete even in the special case when all the nodes in the graph have degree at most 4. 8.7. Consider a special case of 3SAT in which all clauses have exactly three literals, and each variable appears at most three times. Show that this problem can be solved in polynomial time. (Hint: create a bipartite graph with clauses on the left, variables on the right, and edges whenever a variable appears in a clause. Use Exercise 7.30 to show that this graph has a matching.) 8.8. In the EXACT 4SAT problem, the input is a set of clauses, each of which is a disjunction of exactly four literals, and such that each variable occurs at most once in each clause. The goal is to nd a satisfying assignment, if one exists. Prove that EXACT 4SAT is NP-complete. 8.9. In the HITTING SET problem, we are given a family of sets fS1;S2;:::;Sng and a budget b, and we wish to nd a set H of size b which intersects every Si, if such an H exists. In other words, we want H\Si 6=;for all i. Show that HITTING SET is NP-complete. 8.10. Proving NP-completeness by generalization. For each of the problems below, prove that it is NP- complete by showing that it is a generalization of some NP-complete problem we have seen in this chapter. (a) SUBGRAPH ISOMORPHISM: Given as input two undirected graphs G and H, determine whether G is a subgraph of H (that is, whether by deleting certain vertices and edges of H we obtain a graph that is, up to renaming of vertices, identical to G), and if so, return the corresponding mapping of V(G) into V(H). (b) LONGEST PATH: Given a graph G and an integer g, nd in G a simple path of length g. (c) MAX SAT: Given a CNF formula and an integer g, nd a truth assignment that satis es at least g clauses. (d) DENSE SUBGRAPH: Given a graph and two integers a and b, nd a set of a vertices of G such that there are at least b edges between them. (e) SPARSE SUBGRAPH: Given a graph and two integers a and b, nd a set of a vertices of G such that there are at most b edges between them. (f) SET COVER. (This problem generalizes two known NP-complete problems.) (g) RELIABLE NETWORK: We are given two n n matrices, a distance matrix dij and a connec- tivity requirement matrixrij, as well as a budgetb; we must nd a graphG = (f1;2;:::;ng;E) such that (1) the total cost of all edges is b or less and (2) between any two distinct vertices i and j there are rij vertex-disjoint paths. (Hint: Suppose that all dij’s are 1 or 2, b = n, and all rij’s are 2. Which well known NP-complete problem is this?) 8.11. There are many variants of Rudrata’s problem, depending on whether the graph is undirected or directed, and whether a cycle or path is sought. Reduce the DIRECTED RUDRATA PATH problem to each of the following. (a) The (undirected) RUDRATA PATH problem. 280 Algorithms (b) The undirected RUDRATA (s;t)-PATH problem, which is just like RUDRATA PATH except that the endpoints of the path are speci ed in the input. 8.12. The k-SPANNING TREE problem is the following. Input: An undirected graph G = (V;E) Output: A spanning tree of G in which each node has degree k, if such a tree exists. Show that for any k 2: (a) k-SPANNING TREE is a search problem. (b) k-SPANNING TREE is NP-complete. (Hint: Start with k = 2 and consider the relation between this problem and RUDRATA PATH.) 8.13. Determine which of the following problems are NP-complete and which are solvable in polyno- mial time. In each problem you are given an undirected graph G = (V;E), along with: (a) A set of nodes L V, and you must nd a spanning tree such that its set of leaves includes the set L. (b) A set of nodes L V, and you must nd a spanning tree such that its set of leaves is precisely the set L. (c) A set of nodes L V, and you must nd a spanning tree such that its set of leaves is included in the set L. (d) An integer k, and you must nd a spanning tree with k or fewer leaves. (e) An integer k, and you must nd a spanning tree with k or more leaves. (f) An integer k, and you must nd a spanning tree with exactly k leaves. (Hint: All the NP-completeness proofs are by generalization, except for one.) 8.14. Prove that the following problem is NP-complete: given an undirected graph G = (V;E) and an integer k, return a clique of size k as well as an independent set of size k, provided both exist. 8.15. Show that the following problem is NP-complete. MAXIMUM COMMON SUBGRAPH Input: Two graphs G1 = (V1;E1) and G2 = (V2;E2); a budget b. Output: Two set of nodes V01 V1 and V02 V2 whose deletion leaves at least b nodes in each graph, and makes the two graphs identical. 8.16. We are feeling experimental and want to create a new dish. There are various ingredients we can choose from and we’d like to use as many of them as possible, but some ingredients don’t go well with others. If there are n possible ingredients (numbered 1 to n), we write down an n n matrix giving the discord between any pair of ingredients. This discord is a real number between 0:0 and 1:0, where 0:0 means they go together perfectly and 1:0 means they really don’t go together. Here’s an example matrix when there are ve possible ingredients. 1 2 3 4 5 1 0.0 0.4 0.2 0.9 1.0 2 0.4 0.0 0.1 1.0 0.2 3 0.2 0.1 0.0 0.8 0.5 4 0.9 1.0 0.8 0.0 0.2 5 1.0 0.2 0.5 0.2 0.0 S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 281 In this case, ingredients 2 and 3 go together pretty well whereas 1 and 5 clash badly. Notice that this matrix is necessarily symmetric; and that the diagonal entries are always 0:0. Any set of ingredients incurs a penalty which is the sum of all discord values between pairs of ingredients. For instance, the set of ingredientsf1;3;5gincurs a penalty of 0:2 + 1:0 + 0:5 = 1:7. We want this penalty to be small. EXPERIMENTAL CUISINE Input: n, the number of ingredients to choose from; D, then n discord matrix; some number p 0 Output: The maximum number of ingredients we can choose with penalty p. Show that if EXPERIMENTAL CUISINE is solvable in polynomial time, then so is 3SAT. 8.17. Show that for any problem in NP, there is an algorithm which solves in time O(2p(n)), where n is the size of the input instance and p(n) is a polynomial (which may depend on ). 8.18. Show that if P = NP then the RSA cryptosystem (Section 1.4.2) can be broken in polynomial time. 8.19. A kite is a graph on an even number of vertices, say 2n, in which n of the vertices form a clique and the remaining n vertices are connected in a tail that consists of a path joined to one of the vertices of the clique. Given a graph and a goal g, the KITE problem asks for a subgraph which is a kite and which contains 2g nodes. Prove that KITE is NP-complete. 8.20. In an undirected graph G = (V;E), we say D V is a dominating set if every v 2V is either in D or adjacent to at least one member of D. In the DOMINATING SET problem, the input is a graph and a budget b, and the aim is to nd a dominating set in the graph of size at most b, if one exists. Prove that this problem is NP-complete. 8.21. Sequencing by hybridization. One experimental procedure for identifying a new DNA sequence repeatedly probes it to determine which k-mers (substrings of length k) it contains. Based on these, the full sequence must then be reconstructed. Let’s now formulate this as a combinatorial problem. For any string x (the DNA sequence), let (x) denote the multiset of all of its k-mers. In particular, (x) contains exactly jxj k + 1 elements. The reconstruction problem is now easy to state: given a multiset of k-length strings, nd a string x such that (x) is exactly this multiset. (a) Show that the reconstruction problem reduces to RUDRATA PATH. (Hint: Construct a di- rected graph with one node for each k-mer, and with an edge from a to b if the last k 1 characters of a match the rst k 1 characters of b.) (b) But in fact, there is much better news. Show that the same problem also reduces to EULER PATH. (Hint: This time, use one directed edge for each k-mer.) 8.22. In task scheduling, it is common to use a graph representation with a node for each task and a directed edge from task i to task j if i is a precondition for j. This directed graph depicts the precedence constraints in the scheduling problem. Clearly, a schedule is possible if and only if the graph is acyclic; if it isn’t, we’d like to identify the smallest number of constraints that must be dropped so as to make it acyclic. Given a directed graph G = (V;E), a subset E0 E is called a feedback arc set if the removal of edges E0 renders G acyclic. 282 Algorithms FEEDBACK ARC SET (FAS): Given a directed graph G = (V;E) and a budget b, nd a feedback arc set of b edges, if one exists. (a) Show that FAS is in NP. FAS can be shown to be NP-complete by a reduction from VERTEX COVER. Given an instance (G;b) of VERTEX COVER, where G is an undirected graph and we want a vertex cover of size b, we construct a instance (G0;b) of FAS as follows. If G = (V;E) has n verticesv1;:::;vn, then make G0 = (V0;E0) a directed graph with 2n vertices w1;w01;:::;wn;w0n, and n+ 2jEj(directed) edges: (wi;w0i) for all i = 1;2;:::;n. (w0i;wj) and (w0j;wi) for every (vi;vj)2E. (b) Show that if G contains a vertex cover of size b, then G0 contains a feedback arc set of size b. (c) Show that if G0 contains a feedback arc set of size b, then G contains a vertex cover of size (at most) b. (Hint: given a feedback arc set of size b in G0, you may need to rst modify it slightly to obtain another one which is of a more convenient form, but is of the same size or smaller. Then, argue that G must contain a vertex cover of the same size as the modi ed feedback arc set.) 8.23. In the NODE-DISJOINT PATHS problem, the input is an undirected graph in which some vertices have been specially marked: a certain number of sources s1;s2;:::sk and an equal number of destinations t1;t2;:::tk. The goal is to nd k node-disjoint paths (that is, paths which have no nodes in common) where the ith path goes from si to ti. Show that this problem is NP-complete. Here is a sequence of progressively stronger hints. (a) Reduce from 3SAT. (b) For a 3SAT formula with m clauses and n variables, use k = m+n sources and destinations. Introduce one source/destination pair (sx;tx) for each variablex, and one source/destination pair (sc;tc) for each clause c. (c) For each 3SAT clause, introduce 6 new intermediate vertices, one for each literal occurring in that clause and one for its complement. (d) Notice that if the path from sc to tc goes through some intermediate vertex representing, say, an occurrence of variablex, then no other path can go through that vertex. What vertex would you like the other path to be forced to go through instead? Chapter 9 Coping with NP-completeness You are the junior member of a seasoned project team. Your current task is to write code for solving a simple-looking problem involving graphs and numbers. What are you supposed to do? If you are very lucky, your problem will be among the half-dozen problems concerning graphs with weights (shortest path, minimum spanning tree, maximum ow, etc.), that we have solved in this book. Even if this is the case, recognizing such a problem in its natural habitat grungy and obscured by reality and context requires practice and skill. It is more likely that you will need to reduce your problem to one of these lucky ones or to solve it using dynamic programming or linear programming. But chances are that nothing like this will happen. The world of search problems is a bleak landscape. There are a few spots of light brilliant algorithmic ideas each illuminating a small area around it (the problems that reduce to it; two of these areas, linear and dynamic programming, are in fact decently large). But the remaining vast expanse is pitch dark: NP- complete. What are you to do? You can start by proving that your problem is actually NP-complete. Often a proof by generalization (recall the discussion on page 270 and Exercise 8.10) is all that you need; and sometimes a simple reduction from 3SAT or ZOE is not too dif cult to nd. This sounds like a theoretical exercise, but, if carried out successfully, it does bring some tangible rewards: now your status in the team has been elevated, you are no longer the kid who can’t do, and you have become the noble knight with the impossible quest. But, unfortunately, a problem does not go away when proved NP-complete. The real ques- tion is, What do you do next? This is the subject of the present chapter and also the inspiration for some of the most important modern research on algorithms and complexity. NP-completeness is not a death certi cate it is only the beginning of a fascinating adventure. Your problem’s NP-completeness proof probably constructs graphs that are complicated and weird, very much unlike those that come up in your application. For example, even though SAT is NP-complete, satisfying assignments for HORN SAT (the instances of SAT that come up in logic programming) can be found ef ciently (recall Section 5.3). Or, suppose the graphs that arise in your application are trees. In this case, many NP-complete problems, 283 284 Algorithms such as INDEPENDENT SET, can be solved in linear time by dynamic programming (recall Section 6.7). Unfortunately, this approach does not always work. For example, we know that 3SAT is NP-complete. And the INDEPENDENT SET problem, along with many other NP-complete problems, remains so even for planar graphs (graphs that can be drawn in the plane without crossing edges). Moreover, often you cannot neatly characterize the instances that come up in your application. Instead, you will have to rely on some form of intelligent exponential search procedures such as backtracking and branch and bound which are exponential time in the worst-case, but, with the right design, could be very ef cient on typical instances that come up in your application. We discuss these methods in Section 9.1. Or you can develop an algorithm for your NP-complete optimization problem that falls short of the optimum but never by too much. For example, in Section 5.4 we saw that the greedy algorithm always produces a set cover that is no more than logn times the optimal set cover. An algorithm that achieves such a guarantee is called an approximation algorithm. As we will see in Section 9.2, such algorithms are known for many NP-complete optimization problems, and they are some of the most clever and sophisticated algorithms around. And the theory of NP-completeness can again be used as a guide in this endeavor, by showing that, for some problems, there are even severe limits to how well they can be approximated unless of course P = NP. Finally, there are heuristics, algorithms with no guarantees on either the running time or the degree of approximation. Heuristics rely on ingenuity, intuition, a good understanding of the application, meticulous experimentation, and often insights from physics or biology, to attack a problem. We see some common kinds in Section 9.3. 9.1 Intelligent exhaustive search 9.1.1 Backtracking Backtracking is based on the observation that it is often possible to reject a solution by looking at just a small portion of it. For example, if an instance of SAT contains the clause (x1_x2), then all assignments with x1 = x2 = 0 (i.e., false) can be instantly eliminated. To put it differently, by quickly checking and discrediting this partial assignment, we are able to prune a quarter of the entire search space. A promising direction, but can it be systematically exploited? Here’s how it is done. Consider the Boolean formula (w;x;y;z) speci ed by the set of clauses (w_x_y_z); (w_x); (x_y); (y_z); (z_w); (w_z): We will incrementally grow a tree of partial solutions. We start by branching on any one variable, say w: S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 285 Initial formula w = 1w = 0 Plugging w = 0 and w = 1 into , we nd that no clause is immediately violated and thus neither of these two partial assignments can be eliminated outright. So we need to keep branching. We can expand either of the two available nodes, and on any variable of our choice. Let’s try this one: Initial formula w = 1w = 0 x = 0 x = 1 This time, we are in luck. The partial assignment w = 0;x = 1 violates the clause (w_x) and can be terminated, thereby pruning a good chunk of the search space. We backtrack out of this cul-de-sac and continue our explorations at one of the two remaining active nodes. In this manner, backtracking explores the space of assignments, growing the tree only at nodes where there is uncertainty about the outcome, and stopping if at any stage a satisfying assignment is encountered. In the case of Boolean satis ability, each node of the search tree can be described either by a partial assignment or by the clauses that remain when those values are plugged into the original formula. For instance, if w = 0 and x = 0 then any clause with w or x is instantly satis ed and any literal w or x is not satis ed and can be removed. What’s left is (y_z);(y);(y_z): Likewise, w = 0 and x = 1 leaves ();(y_z); with the empty clause () ruling out satis ability. Thus the nodes of the search tree, repre- senting partial assignments, are themselves SAT subproblems. This alternative representation is helpful for making the two decisions that repeatedly arise: which subproblem to expand next, and which branching variable to use. Since the ben- e t of backtracking lies in its ability to eliminate portions of the search space, and since this happens only when an empty clause is encountered, it makes sense to choose the subproblem that contains the smallest clause and to then branch on a variable in that clause. If this clause 286 Algorithms Figure 9.1 Backtracking reveals that is not satis able. (); (y _ z)(y _ z);(y); (y _ z) (z); (z) (x _ y); (y _ z); (z); (z) (x _ y); (y); ()(x _ y); () (w _ x _ y _ z); (w _ x);(x _ y); (y _ z); (z _ w); (w _ z) (x _ y _ z); (x); (x _ y);(y _ z) x = 1 () z = 0 z = 1 () () y = 1 z = 1z = 0 y = 0 w = 1w = 0 x = 0 happens to be a singleton, then at least one of the resulting branches will be terminated. (If there is a tie in choosing subproblems, one reasonable policy is to pick the one lowest in the tree, in the hope that it is close to a satisfying assignment.) See Figure 9.1 for the conclusion of our earlier example. More abstractly, a backtracking algorithm requires a test that looks at a subproblem and quickly declares one of three outcomes: 1. Failure: the subproblem has no solution. 2. Success: a solution to the subproblem is found. 3. Uncertainty. In the case of SAT, this test declares failure if there is an empty clause, success if there are no clauses, and uncertainty otherwise. The backtracking procedure then has the following format. Start with some problem P0 Let S =fP0g, the set of active subproblems Repeat while S is nonempty: choose a subproblem P 2S and remove it from S expand it into smaller subproblems P1;P2;:::;Pk For each Pi: If test(Pi) succeeds: halt and announce this solution If test(Pi) fails: discard Pi S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 287 Otherwise: add Pi to S Announce that there is no solution For SAT, the chooseprocedure picks a clause, and expandpicks a variable within that clause. We have already discussed some reasonable ways of making such choices. With the right test, expand, and chooseroutines, backtracking can be remarkably effec- tive in practice. The backtracking algorithm we showed for SAT is the basis of many successful satis ability programs. Another sign of quality is this: if presented with a 2SAT instance, it will always nd a satisfying assignment, if one exists, in polynomial time (Exercise 9.1)! 9.1.2 Branch-and-bound The same principle can be generalized from search problems such as SAT to optimization problems. For concreteness, let’s say we have a minimization problem; maximization will follow the same pattern. As before, we will deal with partial solutions, each of which represents a subproblem, namely, what is the (cost of the) best way to complete this solution? And as before, we need a basis for eliminating partial solutions, since there is no other source of ef ciency in our method. To reject a subproblem, we must be certain that its cost exceeds that of some other solution we have already encountered. But its exact cost is unknown to us and is generally not ef ciently computable. So instead we use a quick lower bound on this cost. Start with some problem P0 Let S =fP0g, the set of active subproblems bestsofar=1 Repeat while S is nonempty: choose a subproblem (partial solution) P 2S and remove it from S expand it into smaller subproblems P1;P2;:::;Pk For each Pi: If Pi is a complete solution: update bestsofar else if lowerbound(Pi) 0. A partial solution is a simple path a b passing through some vertices S V, where S includes the endpoints a and b. We can denote such a partial solution by the tuple [a;S;b] in fact, awill be xed throughout the algorithm. The corresponding subproblem is to nd the best completion of the tour, that is, the cheapest complementary path b a with intermediate nodes V S. Notice that the initial problem is of the form [a;fag;a] for anya2V of our choosing. At each step of the branch-and-bound algorithm, we extend a particular partial solution [a;S;b] by a single edge (b;x), where x2V S. There can be up tojV Sjways to do this, and each of these branches leads to a subproblem of the form [a;S[fxg;x]. 288 Algorithms How can we lower-bound the cost of completing a partial tour [a;S;b]? Many sophisticated methods have been developed for this, but let’s look at a rather simple one. The remainder of the tour consists of a path through V S, plus edges from a and b to V S. Therefore, its cost is at least the sum of the following: 1. The lightest edge from a to V S. 2. The lightest edge from b to V S. 3. The minimum spanning tree of V S. (Do you see why?) And this lower bound can be computed quickly by a minimum spanning tree algorithm. Figure 9.2 runs through an example: each node of the tree represents a partial tour (speci cally, the path from the root to that node) that at some stage is considered by the branch-and-bound procedure. Notice how just 28 partial solutions are considered, instead of the 7! = 5;040 that would arise in a brute-force search. S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 289 Figure 9.2 (a) A graph and its optimal traveling salesman tour. (b) The branch-and-bound search tree, explored left to right. Boxed numbers indicate lower bounds on cost. (a) A B C D EF G H 1 2 1 11 2 1 2 5 1 1 1 A B C D EF G H 1 1 11 1 1 1 1 (b) A E HF G B F G D 15 14 8 B D C D H G H8 E C G inf 8 10 13 12 8 814 8 8 8 8 10 C10 GE F G H D 11 11 11 11 inf H G 14 1410 10 Cost: 11 Cost: 8 290 Algorithms 9.2 Approximation algorithms In an optimization problem we are given an instance I and are asked to nd the optimum solution the one with the maximum gain if we have a maximization problem like INDEPEN- DENT SET, or the minimum cost if we are dealing with a minimization problem such as the TSP. For every instance I, let us denote by OPT(I) the value (bene t or cost) of the optimum solution. It makes the math a little simpler (and is not too far from the truth) to assume that OPT(I) is always a positive integer. We have already seen an example of a (famous) approximation algorithm in Section 5.4: the greedy scheme for SET COVER. For any instance I of size n, we showed that this greedy algorithm is guaranteed to quickly nd a set cover of cardinality at most OPT(I)logn. This logn factor is known as the approximation guarantee of the algorithm. More generally, consider any minimization problem. Suppose now that we have an algo- rithmAfor our problem which, given an instance I, returns a solution with valueA(I). The approximation ratio of algorithmAis de ned to be A = max I A(I) OPT(I): In other words, A measures by the factor by which the output of algorithmAexceeds the optimal solution, on the worst-case input. The approximation ratio can also be de ned for maximization problems, such as INDEPENDENT SET, in the same way except that to get a number larger than 1 we take the reciprocal. So, when faced with an NP-complete optimization problem, a reasonable goal is to look for an approximation algorithmAwhose A is as small as possible. But this kind of guarantee might seem a little puzzling: How can we come close to the optimum if we cannot determine the optimum? Let’s look at a simple example. 9.2.1 Vertex cover We already know the VERTEX COVER problem is NP-hard. VERTEX COVER Input: An undirected graph G = (V;E). Output: A subset of the vertices S V that touches every edge. Goal: MinimizejSj. See Figure 9.3 for an example. Since VERTEX COVER is a special case of SET COVER, we know from Chapter 5 that it can be approximated within a factor of O(logn) by the greedy algorithm: repeatedly delete the vertex of highest degree and include it in the vertex cover. And there are graphs on which the greedy algorithm returns a vertex cover that is indeed logn times the optimum. A better approximation algorithm for VERTEX COVER is based on the notion of a matching, a subset of edges that have no vertices in common (Figure 9.4). A matching is maximal if no S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 291 Figure 9.3 A graph whose optimal vertex cover, shown shaded, is of size 8. Figure 9.4 (a) A matching, (b) its completion to a maximal matching, and (c) the resulting vertex cover. (a) (b) (c) more edges can be added to it. Maximal matchings will help us nd good vertex covers, and moreover, they are easy to generate: repeatedly pick edges that are disjoint from the ones chosen already, until this is no longer possible. What is the relationship between matchings and vertex covers? Here is the crucial fact: any vertex cover of a graphGmust be at least as large as the number of edges in any matching in G; that is, any matching provides a lower bound on OPT. This is simply because each edge of the matching must be covered by one of its endpoints in any vertex cover! Finding such a lower bound is a key step in designing an approximation algorithm, because we must compare the quality of the solution found by our algorithm to OPT, which is NP-complete to compute. One more observation completes the design of our approximation algorithm: let S be a set that contains both endpoints of each edge in a maximal matching M. Then S must be a vertex cover if it isn’t, that is, if it doesn’t touch some edge e2E, then M could not possibly be maximal since we could still add e to it. But our cover S has 2jMj vertices. And from the previous paragraph we know that any vertex cover must have size at leastjMj. So we’re done. Here’s the algorithm for VERTEX COVER. Find a maximal matching M E 292 Algorithms Return S =fall endpoints of edges in Mg This simple procedure always returns a vertex cover whose size is at most twice optimal! In summary, even though we have no way of nding the best vertex cover, we can easily nd another structure, a maximal matching, with two key properties: 1. Its size gives us a lower bound on the optimal vertex cover. 2. It can be used to build a vertex cover, whose size can be related to that of the optimal cover using property 1. Thus, this simple algorithm has an approximation ratio of A 2. In fact, it is not hard to nd examples on which it does make a 100% error; hence A = 2. 9.2.2 Clustering We turn next to a clustering problem, in which we have some data (text documents, say, or images, or speech samples) that we want to divide into groups. It is often useful to de ne dis- tances between these data points, numbers that capture how close or far they are from one another. Often the data are true points in some high-dimensional space and the distances are the usual Euclidean distance; in other cases, the distances are the result of some similarity tests to which we have subjected the data points. Assume that we have such distances and that they satisfy the usual metric properties: 1. d(x;y) 0 for all x;y. 2. d(x;y) = 0 if and only if x = y. 3. d(x;y) = d(y;x). 4. (Triangle inequality) d(x;y) d(x;z) +d(z;y). We would like to partition the data points into groups that are compact in the sense of having small diameter. k-CLUSTER Input: Points X =fx1;:::;xngwith underlying distance metric d( ; ); integer k. Output: A partition of the points into k clusters C1;:::;Ck. Goal: Minimize the diameter of the clusters, maxj max xa;xb2Cj d(xa;xb): One way to visualize this task is to imagine n points in space, which are to be covered by k spheres of equal size. What is the smallest possible diameter of the spheres? Figure 9.5 shows an example. This problem is NP-hard, but has a very simple approximation algorithm. The idea is to pick k of the data points as cluster centers and to then assign each of the remaining points to S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 293 Figure 9.5 Some data points and the optimal k = 4 clusters. Figure 9.6 (a) Four centers chosen by farthest- rst traversal. (b) The resulting clusters. (a) 2 1 4 3 (b) the center closest to it, thus creating k clusters. The centers are picked one at a time, using an intuitive rule: always pick the next center to be as far as possible from the centers chosen so far (see Figure 9.6). Pick any point 1 2X as the first cluster center for i = 2 to k: Let i be the point in X that is farthest from 1;:::; i 1 (i.e., that maximizes minj 0 produces an instance I(G;C) of the TSP such that: (i) If G has a Rudrata path, then OPT(I(G;C)) = n, the number of vertices in G. (ii) If G has no Rudrata path, then OPT(I(G;C)) n+C. This means that even an approximate solution to TSP would enable us to solve RUDRATA PATH! Let’s work out the details. Consider an approximation algorithmAfor TSP and let A denote its approximation ratio. From any instance G of RUDRATA PATH, we will create an instance I(G;C) of TSP using the speci c constant C = n A. What happens when algorithmAis run on this TSP instance? In case (i), it must output a tour of length at most AOPT(I(G;C)) = n A, whereas in case (ii) it must output a tour of length at least OPT(I(G;C)) >n A. Thus we can gure out whether G has a Rudrata path! Here is the resulting procedure: Given any graph G: compute I(G;C) (with C = n A) and run algorithm A on it if the resulting tour has length n A: conclude that G has a Rudrata path else: conclude that G has no Rudrata path This tells us whether or not G has a Rudrata path; by calling the procedure a polynomial number of times, we can nd the actual path (Exercise 8.2). We’ve shown that if TSP has a polynomial-time approximation algorithm, then there is a polynomial algorithm for the NP-complete RUDRATA PATH problem. So, unless P = NP, there cannot exist an ef cient approximation algorithm for the TSP. 296 Algorithms 9.2.4 Knapsack Our last approximation algorithm is for a maximization problem and has a very impressive guarantee: given any > 0, it will return a solution of value at least (1 ) times the optimal value, in time that scales only polynomially in the input size and in 1= . The problem is KNAPSACK, which we rst encountered in Chapter 6. There are n items, with weights w1;:::;wn and values v1;:::;vn (all positive integers), and the goal is to pick the most valuable combination of items subject to the constraint that their total weight is at most W. Earlier we saw a dynamic programming solution to this problem with running timeO(nW). Using a similar technique, a running time of O(nV) can also be achieved, where V is the sum of the values. Neither of these running times is polynomial, because W and V can be very large, exponential in the size of the input. Let’s consider the O(nV) algorithm. In the bad case when V is large, what if we simply scale down all the values in some way? For instance, if v1 = 117;586;003; v2 = 738;493;291; v3 = 238;827;453; we could simply knock off some precision and instead use 117, 738, and 238. This doesn’t change the problem all that much and will make the algorithm much, much faster! Now for the details. Along with the input, the user is assumed to have speci ed some approximation factor > 0. Discard any item with weight >W Let vmax = maxivi Rescale values bvi =bvi n vmaxc Run the dynamic programming algorithm with values fbvig Output the resulting choice of items Let’s see why this works. First of all, since the rescaled values bvi are all at most n= , the dynamic program is ef cient, running in time O(n3= ). Now suppose the optimal solution to the original problem is to pick some subset of items S, with total value K . The rescaled value of this same assignment is X i2S bvi = X i2S vi n v max X i2S vi n v max 1 K n v max n: Therefore, the optimal assignment for the shrunken problem, call it bS, has a rescaled value of at least this much. In terms of the original values, assignment bS has a value of at least X i2bS vi X i2bS bvi vmaxn K n v max n vmaxn = K vmax K (1 ): S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 297 9.2.5 The approximability hierarchy Given any NP-complete optimization problem, we seek the best approximation algorithm possible. Failing this, we try to prove lower bounds on the approximation ratios that are achievable in polynomial time (we just carried out such a proof for the general TSP). All told, NP-complete optimization problems are classi ed as follows: Those for which, like the TSP, no nite approximation ratio is possible. Those for which an approximation ratio is possible, but there are limits to how small this can be. VERTEX COVER, k-CLUSTER, and the TSP with triangle inequality belong here. (For these problems we have not established limits to their approximability, but these limits do exist, and their proofs constitute some of the most sophisticated results in this eld.) Down below we have a more fortunate class of NP-complete problems for which ap- proximability has no limits, and polynomial approximation algorithms with error ratios arbitrarily close to zero exist. KNAPSACK resides here. Finally, there is another class of problems, between the rst two given here, for which the approximation ratio is about logn. SET COVER is an example. (A humbling reminder: All this is contingent upon the assumption P 6= NP. Failing this, this hierarchy collapses down to P, and all NP-complete optimization problems can be solved exactly in polynomial time.) A nal point on approximation algorithms: often these algorithms, or their variants, per- form much better on typical instances than their worst-case approximation ratio would have you believe. 9.3 Local search heuristics Our next strategy for coping with NP-completeness is inspired by evolution (which is, after all, the world’s best-tested optimization procedure) by its incremental process of introducing small mutations, trying them out, and keeping them if they work well. This paradigm is called local search and can be applied to any optimization task. Here’s how it looks for a minimization problem. let s be any initial solution while there is some solution s0 in the neighborhood of s for which cost(s0) w(S) do: set S = S0 (a) Show that this is an approximation algorithm for MAX CUT with ratio 2. (b) But is it a polynomial-time algorithm? 9.10. Let us call a local search algorithm exact when it always produces the optimum solution. For example, the local search algorithm for the minimum spanning tree problem introduced in Prob- lem 9.5 is exact. For another example, simplex can be considered an exact local search algorithm for linear programming. (a) Show that the 2-change local search algorithm for the TSP is not exact. (b) Repeat for thedn2e-change local search algorithm, where n is the number of cities. (c) Show that the (n 1)-change local search algorithm is exact. (d) IfAis an optimization problem, de neA-IMPROVEMENT to be the following search problem: Given an instance x of A and a solution s of A, nd another solution of x with better cost (or report that none exists, and thus s is optimum). For example, in TSP IMPROVEMENT we are given a distance matrix and a tour, and we are asked to nd a better tour. It turns out that TSP IMPROVEMENT is NP-complete, and so is SET COVER IMPROVEMENT. (Can you prove this?) S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 309 (e) We say that a local search algorithm has polynomial iteration if each execution of the loop requires polynomial time. For example, the obvious implementations of the (n 1)-change local search algorithm for the TSP de ned above do not have polynomial iteration. Show that, unless P = NP, there is no exact local search algorithm with polynomial iteration for the TSP and SET COVER problems. 310 Algorithms Chapter 10 Quantum algorithms This book started with the world’s oldest and most widely used algorithms (the ones for adding and multiplying numbers) and an ancient hard problem (FACTORING). In this last chapter the tables are turned: we present one of the latest algorithms and it is an ef cient algorithm for FACTORING! There is a catch, of course: this algorithm needs a quantum computer to execute. Quantum physics is a beautiful and mysterious theory that describes Nature in the small, at the level of elementary particles. One of the major discoveries of the nineties was that quantum computers computers based on quantum physics principles are radically differ- ent from those that operate according to the more familiar principles of classical physics. Surprisingly, they can be exponentially more powerful: as we shall see, quantum computers can solve FACTORING in polynomial time! As a result, in a world with quantum computers, the systems that currently safeguard business transactions on the Internet (and are based on the RSA cryptosystem) will no longer be secure. 10.1 Qubits, superposition, and measurement In this section we introduce the basic features of quantum physics that are necessary for understanding how quantum computers work.1 In ordinary computer chips, bits are physically represented by low and high voltages on wires. But there are many other ways a bit could be stored for instance, in the state of a hydrogen atom. The single electron in this atom can either be in the ground state (the lowest energy con guration) or it can be in an excited state (a high energy con guration). We can use these two states to encode for bit values 0 and 1, respectively. Let us now introduce some quantum physics notation. We denote the ground state of our electron by 0 , since it encodes for bit value 0, and likewise the excited state by 1 . These are 1This eld is so strange that the famous physicist Richard Feynman is quoted as having said, I think I can safely say that no one understands quantum physics. So there is little chance you will understand the theory in depth after reading this section! But if you are interested in learning more, see the recommended reading at the book’s end. 311 312 Algorithms Figure 10.1 An electron can be in a ground state or in an excited state. In the Dirac notation used in quantum physics, these are denoted 0 and 1 . But the superposition principle says that, in fact, the electron is in a state that is a linear combination of these two: 0 0 + 1 1 . This would make immediate sense if the ’s were probabilities, nonnegative real numbers adding to 1. But the superposition principle insists that they can be arbitrary complex num- bers, as long as the squares of their norms add up to 1! ground state 0 excited state 1 superposition 0 0 + 1 1 the two possible states of the electron in classical physics. Many of the most counterintuitive aspects of quantum physics arise from the superposition principle which states that if a quantum system can be in one of two states, then it can also be in any linear superposition of those two states. For instance, the state of the electron could well be 1p2 0 + 1p2 1 or 1p 2 0 1 p2 1 ; or an in nite number of other combinations of the form 0 0 + 1 1 . The coef cient 0 is called the amplitude of state 0 , and similarly with 1. And if things aren’t already strange enough the ’s can be complex numbers, as long as they are normalized so thatj 0j2 +j 1j2 = 1. For example, 1p5 0 + 2ip5 1 (where i is the imaginary unit, p 1) is a perfectly valid quantum state! Such a superposition, 0 0 + 1 1 , is the basic unit of encoded information in quantum computers (Figure 10.1). It is called a qubit (pronounced cubit ). The whole concept of a superposition suggests that the electron does not make up its mind about whether it is in the ground or excited state, and the amplitude 0 is a measure of its inclination toward the ground state. Continuing along this line of thought, it is tempting to think of 0 as the probability that the electron is in the ground state. But then how are we to make sense of the fact that 0 can be negative, or even worse, imaginary? This is one of the most mysterious aspects of quantum physics, one that seems to extend beyond our intuitions about the physical world. This linear superposition, however, is the private world of the electron. For us to get a glimpse of the electron’s state we must make a measurement, and when we do so, we get a single bit of information 0 or 1. If the state of the electron is 0 0 + 1 1 , then the outcome of the measurement is 0 with probability j 0j2 and 1 with probability j 1j2 (luckily we normalized so j 0j2 +j 1j2 = 1). Moreover, the act of measurement causes the system to change its state: if the outcome of the measurement is 0, then the new state of the system is 0 (the ground state), and if the outcome is 1, the new state is 1 (the excited state). This S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 313 feature of quantum physics, that a measurement disturbs the system and forces it to choose (in this case ground or excited state), is another strange phenomenon with no classical analog. Figure 10.2 Measurement of a superposition has the effect of forcing the system to decide on a particular state, with probabilities determined by the amplitudes. with probj 0j2 with probj 1j2 0 0 + 1 1 state 0 state 1 The superposition principle holds not just for 2-level systems like the one we just described, but in general for k-level systems. For example, in reality the electron in the hydrogen atom can be in one of many energy levels, starting with the ground state, the rst excited state, the second excited state, and so on. So we could consider a k-level system consisting of the ground state and the rst k 1 excited states, and we could denote these by 0 ; 1 ; 2 ;:::; k 1 . The superposition principle would then say that the general quantum state of the system is 0 0 + 1 1 + + k 1 k 1 , where Pk 1j=0j jj2 = 1. Measuring the state of the system would now reveal a number between 0 and k 1, and outcome j would occur with probability j jj2. As before, the measurement would disturb the system, and the new state would actually become j or the jth excited state. How do we encode n bits of information? We could choose k = 2n levels of the hydrogen atom. But a more promising option is to use n qubits. Let us start by considering the case of two qubits, that is, the state of the electrons of two hydrogen atoms. Since each electron can be in either the ground or excited state, in classi- cal physics the two electrons have a total of four possible states 00, 01, 10, or 11 and are therefore suitable for storing 2 bits of information. But in quantum physics, the superposition principle tells us that the quantum state of the two electrons is a linear combination of the four classical states, = 00 00 + 01 01 + 10 10 + 11 11 ; normalized so that Px2f0;1g2j xj2 = 1.2 Measuring the state of the system now reveals 2 bits 2Recall that f0; 1g2 denotes the set consisting of the four 2-bit binary strings and in general f0; 1gn denotes the set of all n-bit binary strings. 314 Algorithms Entanglement Suppose we have two qubits, the rst in the state 0 0 + 1 1 and the second in the state 0 0 + 1 1 . What is the joint state of the two qubits? The answer is, the (tensor) product of the two: 0 0 00 + 0 1 01 + 1 0 10 + 1 1 11 . Given an arbitrary state of two qubits, can we specify the state of each individual qubit in this way? No, in general the two qubits are entangled and cannot be decomposed into the states of the individual qubits. For example, consider the state = 1p2 00 + 1p2 11 , which is one of the famous Bell states. It cannot be decomposed into states of the two individual qubits (see Exercise 10.1). Entanglement is one of the most mysterious aspects of quantum mechanics and is ultimately the source of the power of quantum computation. of information, and the probability of outcome x 2f0;1g2 is j xj2. Moreover, as before, if the outcome of measurement is jk, then the new state of the system is jk : if jk = 10, for example, then the rst electron is in the excited state and the second electron is in the ground state. An interesting question comes up here: what if we make a partial measurement? For instance, if we measure just the rst qubit, what is the probability that the outcome is 0? This is simple. It is exactly the same as it would have been had we measured both qubits, namely, Prf1st bit = 0g = Prf00g+ Prf01g = j 00j2 +j 01j2. Fine, but how much does this partial measurement disturb the state of the system? The answer is elegant. If the outcome of measuring the rst qubit is 0, then the new superposition is obtained by crossing out all terms of that are inconsistent with this outcome (that is, whose rst bit is 1). Of course the sum of the squares of the amplitudes is no longer 1, so we must renormalize. In our example, this new state would be new = 00q j 00j2 +j 01j2 00 + 01q j 00j2 +j 01j2 01 : Finally, let us consider the general case of n hydrogen atoms. Think of n as a fairly small number of atoms, say n = 500. Classically the states of the 500 electrons could be used to store 500 bits of information in the obvious way. But the quantum state of the 500 qubits is a linear superposition of all 2500 possible classical states: X x2f0;1gn x x : It is as if Nature has 2500 scraps of paper on the side, each with a complex number written on it, just to keep track of the state of this system of 500 hydrogen atoms! Moreover, at each moment, as the state of the system evolves in time, it is as though Nature crosses out the complex number on each scrap of paper and replaces it with its new value. Let us consider the effort involved in doing all this. The number 2500 is much larger than estimates of the number of elementary particles in the universe. Where, then, does Nature store this information? How could microscopic quantum systems of a few hundred atoms S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 315 Figure 10.3 A quantum algorithm takes n classical bits as its input, manipulates them so as to create a superposition of their 2n possible states, manipulates this exponentially large superposition to obtain the nal quantum result, and then measures the result to get (with the appropriate probability distribution) the n output bits. For the middle phase, there are elementary operations which count as one step and yet manipulate all the exponentially many amplitudes of the superposition. Exponential superposition Input x Output y n-bit stringn-bit string contain more information than we can possibly store in the entire classical universe? Surely this is a most extravagant theory about the amount of effort put in by Nature just to keep a tiny system evolving in time. In this phenomenon lies the basic motivation for quantum computation. After all, if Na- ture is so extravagant at the quantum level, why should we base our computers on classical physics? Why not tap into this massive amount of effort being expended at the quantum level? But there is a fundamental problem: this exponentially large linear superposition is the private world of the electrons. Measuring the system only reveals n bits of information. As before, the probability that the outcome is a particular 500-bit string x isj xj2. And the new state after measurement is just x . 10.2 The plan A quantum algorithm is unlike any you have seen so far. Its structure re ects the tension between the exponential private workspace of an n-qubit system and the mere n bits that can be obtained through measurement. The input to a quantum algorithm consists of n classical bits, and the output also consists of n classical bits. It is while the quantum system is not being watched that the quantum effects take over and we have the bene t of Nature working exponentially hard on our behalf. If the input is an n-bit string x, then the quantum computer takes as input n qubits in 316 Algorithms state x . Then a series of quantum operations are performed, by the end of which the state of the nqubits has been transformed to some superposition Py y y . Finally, a measurement is made, and the output is the n-bit string y with probabilityj yj2. Observe that this output is random. But this is not a problem, as we have seen before with randomized algorithms such as the one for primality testing. As long as y corresponds to the right answer with high enough probability, we can repeat the whole process a few times to make the chance of failure miniscule. Now let us look more closely at the quantum part of the algorithm. Some of the key quantum operations (which we will soon discuss) can be thought of as looking for certain kinds of patterns in a superposition of states. Because of this, it is helpful to think of the algorithm as having two stages. In the rst stage, the n classical bits of the input are unpacked into an exponentially large superposition, which is expressly set up so as to have an underlying pattern or regularity that, if detected, would solve the task at hand. The second stage then consists of a suitable set of quantum operations, followed by a measurement, which reveals the hidden pattern. All this probably sounds quite mysterious at the moment, but more details are on the way. In Section 10.3 we will give a high-level description of the most important operation that can be ef ciently performed by a quantum computer: a quantum version of the fast Fourier transform (FFT). We will then describe certain patterns that this quantum FFT is ideally suited to detect, and will show how to recast the problem of factoring an integer N in terms of detecting precisely such a pattern. Finally we will see how to set up the initial stage of the quantum algorithm, which converts the input N into an exponentially large superposition with the right kind of pattern. The algorithm to factor a large integer N can be viewed as a sequence of reductions (and everything shown here in italics will be de ned in good time): FACTORING is reduced to nding a nontrivial square root of 1 modulo N. Finding such a root is reduced to computing the order of a random integer modulo N. The order of an integer is precisely the period of a particular periodic superposition. Finally, periods of superpositions can be found by the quantum FFT. We begin with the last step. 10.3 The quantum Fourier transform Recall the fast Fourier transform (FFT) from Chapter 2. It takes as input an M-dimensional, complex-valued vector (whereM is a power of 2, sayM = 2m), and outputs anM-dimensional S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 317 complex-valued vector : 2 66 66 64 0 1 2 ... M 1 3 77 77 75 = 1p M 2 66 66 66 66 66 4 1 1 1 1 1 ! !2 !M 1 1 !2 !4 !2(M 1) ... 1 !j !2j !(M 1)j ... 1 !(M 1) !2(M 1) !(M 1)(M 1) 3 77 77 77 77 77 5 2 66 66 64 0 1 2 ... M 1 3 77 77 75; where ! is a complex Mth root of unity (the extra factor of pM is new and has the effect of ensuring that if the j ij2 add up to 1, then so do the j ij2). Although the preceding equa- tion suggests an O(M2) algorithm, the classical FFT is able to perform this calculation in just O(M logM) steps, and it is this speedup that has had the profound effect of making digital sig- nal processing practically feasible. We will now see that quantum computers can implement the FFT exponentially faster, in O(log2M) time! But wait, how can any algorithm take time less than M, the length of the input? The point is that we can encode the input in a superposition of just m = logM qubits: after all, this superposition consists of 2m amplitude values. In the notation we introduced earlier, we would write the superposition as = PM 1j=0 j j where i is the amplitude of the m-bit binary string corresponding to the number i in the natural way. This brings up an important point: the j notation is really just another way of writing a vector, where the index of each entry of the vector is written out explicitly in the special bracket symbol. Starting from this input superposition , the quantum Fourier transform (QFT) manip- ulates it appropriately in m = logM stages. At each stage the superposition evolves so that it encodes the intermediate results at the same stage of the classical FFT (whose circuit, with m = logM stages, is reproduced from Chapter 2 in Figure 10.4). As we will see in Section 10.5, this can be achieved with m quantum operations per stage. Ultimately, after m such stages and m2 = log2M elementary operations, we obtain the superposition that corresponds to the desired output of the QFT. So far we have only considered the good news about the QFT: its amazing speed. Now it is time to read the ne print. The classical FFT algorithm actually outputs the M complex numbers 0;:::; M 1. In contrast, the QFT only prepares a superposition = PM 1j=0 j . And, as we saw earlier, these amplitudes are part of the private world of this quantum system. Thus the only way to get our hands on this result is by measuring it! And measuring the state of the system only yields m = logM classical bits: speci cally, the output is index j with probabilityj jj2. So, instead of QFT, it would be more accurate to call this algorithm quantum Fourier sampling. Moreover, even though we have con ned our attention to the case M = 2m in this section, the algorithm can be implemented for arbitrary values of M, and can be summarized as follows: 318 Algorithms Figure 10.4 The classical FFT circuit from Chapter 2. Input vectors of M bits are processed in a sequence of m = logM levels. a0a1 a2a3 a4a5 a6a7 a8a9 a10a11 a12a13 a14a15 a16a17 a18a19 a20a21 a22a23 a24a25 a26a27 a28a29 a30a31 a32a33 a34a35 a36a37 a38a39 a40a41 a42a43 a44a45 a46a47 0 4 2 6 1 5 7 3 1 4 4 4 4 6 6 7 4 4 2 2 63 2 5 4 0 1 2 3 4 5 6 7 Input: A superposition of m = logM qubits, =PM 1j=0 j j . Method: Using O(m2) = O(log2M) quantum operations perform the quantum FFT to obtain the superposition =PM 1j=0 j j . Output: A random m-bit number j (that is, 0 j M 1), from the probability distribution Pr[j] =j jj2. Quantum Fourier sampling is basically a quick way of getting a very rough idea about the output of the classical FFT, just detecting one of the larger components of the answer vector. In fact, we don’t even see the value of that component we only see its index. How can we use such meager information? In which applications of the FFT is just the index of the large components enough? This is what we explore next. 10.4 Periodicity Suppose that the input to the QFT, = ( 0; 1;:::; M 1), is such that i = j whenever i j mod k, where k is a particular integer that divides M. That is, the array consists of M=k repetitions of some sequence ( 0; 1;:::; k 1) of length k. Moreover, suppose that S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 319 exactly one of the k numbers 0;:::; k 1 is nonzero, say j. Then we say that is periodic with period k and offset j. Figure 10.5 Examples of periodic superpositions. 0 M 6 3 6 9 M 3 M 7 M 3 1 5 9 period 4 period 3 It turns out that if the input vector is periodic, we can use quantum Fourier sampling to compute its period! This is based on the following fact, proved in the next box: Suppose the input to quantum Fourier sampling is periodic with period k, for some k that divides M. Then the output will be a multiple of M=k, and it is equally likely to be any of the k multiples of M=k. Now a little thought tells us that by repeating the sampling a few times (repeatedly preparing the periodic superposition and doing Fourier sampling), and then taking the greatest common divisor of all the indices returned, we will with very high probability get the number M=k and from it the period k of the input! 320 Algorithms The Fourier transform of a periodic vector Suppose the vector = ( 0; 1;:::; M 1) is periodic with period k and with no offset (that is, the nonzero terms are 0; k; 2k;:::). Thus, = M=k 1X j=0 q k M jk : We will show that its Fourier transform = ( 0; 1;:::; M 1) is also periodic, with period M=k and no offset. Claim = 1pk Pk 1j=0 jMk . Proof. In the input vector, the coef cient ‘ is pk=M if k divides ‘, and is zero otherwise. We can plug this into the formula for the jth coef cient of : j = 1pM M 1X ‘=0 !j‘ ‘ = pk M M=k 1X i=0 !jik: The summation is a geometric series, 1 +!jk +!2jk +!3jk + , containing M=k terms and with ratio !jk (recall that ! is a complex Mth root of unity). There are two cases. If the ratio is exactly 1, which happens if jk 0 mod M, then the sum of the series is simply the number of terms. If the ratio isn’t 1, we can apply the usual formula for geometric series to nd that the sum is 1 !jk(M=k)1 !jk = 1 !Mj1 !jk = 0. Therefore j is 1=pk if M divides jk, and is zero otherwise. More generally, we can consider the original superposition to be periodic with period k, but with some offset l