Thoughts on finding the essential elements of a set

Raymond Chen

Suppose you have a set of n items, two of which are essential, and the rest are superfluous. You can pass any subset of these items to an oracle, and the oracle will tell you whether the set contains all of the essential elements. The objective is to identify those essential elements.

You might run into this problem if the elements are package dependencies, and you want to figure out which ones are actually necessary for your project to build, and which ones are just cargo cult.

If the problem had been formulated with just one essential element, then it would be a simply binary search: Divide the set into equal-sized subsets and ask the oracle which subset contains the essential element. Recurse on that subset, and you can find the essential element in O(log n) steps.

But what if there are two essential elements? You could try the same thing and divide in half, but if the oracle says “Neither half contains both of the essential elements,” then you’re in a bit of a pickle because you don’t know which pieces of the two halves need to be combined.

One option is to try to peel off the essential elements one at a time. For example, an inefficient algorithm would be to remove one element and ask the oracle of the remaining elements include all the essential elements. If it says yes, then you can recurse with the smaller set. if it says no, then you know that the element you removed is one of the essential elements, and you can now use the “find one essential element” algorithm on the rest. (Just remember to add the essential element you already found to each query you pass to the oracle.)

Now that we have an inefficient algorithm, we can try to make it more efficient: Instead of removing one element at a time, you can use a binary search to find the “highest-numbered essential element”: At the start, you know that the (zero-based) index of the highest-numbered essential element is somewhere between 1 and n − 1,¹ inclusive. At each step, find the midpoint between the low and high boundaries of the range and ask the oracle whether all the elements up to that midpoint element include all the essential elements. If so, then you can move the upper boundary of the range down to the midpoint; if not, then you can move the lower boundary of the range up to the midpoint. In this way, you can do a binary search on “the highest-numbered essential elements.”

And then once you’ve found one essential element, you can use a regular binary search to find the other one.

We can generalize this to the case where there are m essential elements: Start with a known range of (m − 1) to (n − 1), and use binary search to find the highest-numbered essential element. Once you’ve done that, you’ve reduced the problem to finding the m − 1 essential elements below the highest-numbered essential element, and so on, for a total complexity of O(m log n).

¹ You know that it cannot be zero, because there are two essential elements, and the earliest you can get both of them is if one of them is at index 0 and the other is at index 1.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

Discussion is closed. Login to edit/delete existing comments.

Ian Yates August 31, 2024

Reminds me of a YouTube video I just watched last night. It's an efficient way to solve the flashlight & 4 good vs 4 bad batteries problem. Flashlight only works if it's given 2 good batteries simultaneously - so 1 good + 1 bad, or 2 bad, means no light. You need to find 2 good batteries - how many steps are required in worst case?
Surprisingly it's 7 (including the final "test" where you know it'll work). It comes down to exploiting that you *know* there are 4 good and 4 bad, which is not the same as...
Read more
Reminds me of a YouTube video I just watched last night. It’s an efficient way to solve the flashlight & 4 good vs 4 bad batteries problem. Flashlight only works if it’s given 2 good batteries simultaneously – so 1 good + 1 bad, or 2 bad, means no light. You need to find 2 good batteries – how many steps are required in worst case?
Surprisingly it’s 7 (including the final “test” where you know it’ll work). It comes down to exploiting that you *know* there are 4 good and 4 bad, which is not the same as “at least 4 good”. Naive approach has you doing 20+ tests in the worst case as that’s not using all provided knowledge.
Search for “The famous batteries and flashlight logic puzzle” to find it (or at least the version I watched, as there are a few) on a channel called MindYourDecisions.

Read less
Jonas Barklund August 27, 2024 · Edited

Sorry if I’m repeating myself, I cannot see my previous reply so trying one more time.
If one thinks of this as a sorting problem where element A < element B if A is essential and B is not, then it seems fairly obvious that one can find the essential elements in an array of n elements in O(n log n) time.
alan robinson August 27, 2024

Depends on who's efficiency you care about, the programmer or the CPU. This if it's the programmer, a genetic algorithm (GA) inspired approach where you are evolving which part to discard would do the job nicely with minimal problem-specific code. All you do is randomly discard half the set and submit it to the oracle; if it passes restart with that as the new set to search. Otherwise, submit a new random permutation of the current set. Also has the upside of generalizing to unknown (n), and should be much faster than the simplest discard 1 solution that...
Read more
Depends on who’s efficiency you care about, the programmer or the CPU. This if it’s the programmer, a genetic algorithm (GA) inspired approach where you are evolving which part to discard would do the job nicely with minimal problem-specific code. All you do is randomly discard half the set and submit it to the oracle; if it passes restart with that as the new set to search. Otherwise, submit a new random permutation of the current set. Also has the upside of generalizing to unknown (n), and should be much faster than the simplest discard 1 solution that Raymond starts out with, but barely more complex to implement, or could even be fully formed in your toolbox already as it’s a generic search method.

Read less
David Gershnik August 26, 2024

A different algorithm, though one unfortunately requiring exact knowledge of m (how many essential elements).
Partition the set into m+1 parts. I'll number them starting at 1. (For m=2, thirds)
Test the whole set except part 1 - If yes, eliminate part 1 and restart with your smaller set
If no, test the whole set except part 2. Continue like this until one of your tests says yes. (One of them will by the pigeonhole principle)
Each iteration eliminates 1/m+1 of the set, meaning that the algorithm is log(n) with the base of the log being (m+1)/m

As for the constant...
Read more
A different algorithm, though one unfortunately requiring exact knowledge of m (how many essential elements).
Partition the set into m+1 parts. I’ll number them starting at 1. (For m=2, thirds)
Test the whole set except part 1 – If yes, eliminate part 1 and restart with your smaller set
If no, test the whole set except part 2. Continue like this until one of your tests says yes. (One of them will by the pigeonhole principle)
Each iteration eliminates 1/m+1 of the set, meaning that the algorithm is log(n) with the base of the log being (m+1)/m

As for the constant coefficient, for each test the probability that no essential items are in that partition, i.e. that every essential item is NOT in that partition is ((m-1)/m)^m, or in the limit, 1/e. We should therefore expect to do e tests during each iteration. This is the part I am most shaky on, my probability/stats is rusty.

Unclear if this is more efficient than repeated binary searches but was the first thing that came to mind for me.

Read less
Stuart Ballard August 26, 2024

For the specific case of two essential elements, how about this approach:

Split the set in two. Test each half. If one half passes, that set contains both the essential elements; throw out the rest and recurse.
If neither half passes, you know you have one essential element in each half. That means you can use the algorithm for finding a single essential element on each half, by asking if (subset of A) u (whole of B) contains both essential elements.
Kevin Norris August 26, 2024

Harder generalization: What if we don't know how many essential elements there are? We just know that "some" elements are essential.

Raymond's naive algorithm does work for this case (with slight modifications to ensure it does not terminate until it has gone through the whole set). I would tend to assume that you can adapt the fast algorithm in a similar way... but that would seem to have a running time of O(n log n), which is too slow because the naive algorithm is O(n). Perhaps the naive algorithm is already as good as we can get - it's hard to...
Read more
Harder generalization: What if we don’t know how many essential elements there are? We just know that “some” elements are essential.

Raymond’s naive algorithm does work for this case (with slight modifications to ensure it does not terminate until it has gone through the whole set). I would tend to assume that you can adapt the fast algorithm in a similar way… but that would seem to have a running time of O(n log n), which is too slow because the naive algorithm is O(n). Perhaps the naive algorithm is already as good as we can get – it’s hard to believe you can go faster than O(n) if you don’t have a termination condition other than “we tried everything.”

Read less
- Raymond Chen Author August 26, 2024
  
  Yeah, if you extend this algorithm to an unknown number of essential elements, you degrade to O(n log n) in the worst case that there are n/2 essential elements spread out at every other position.
  - Akash Bagh August 26, 2024 · Edited
    
    There is a simple algorithm that doesn't go slower than O(n) in the case where the number of required elements is unknown.
    
    Instead of starting a binary search to find the first required element, try dropping the last 1, 2, 4, 8, 16 etc. unknown elements, until you get a negative response (meaning: you dropped a required element). Then you know that the last chunk contains a required element, and you can do a binary search to find which one it is, as described in the blog post. Repeat until all elements are classified.
    
    This way, you can identify the sizes of...
    Read more
    There is a simple algorithm that doesn’t go slower than O(n) in the case where the number of required elements is unknown.
    
    Instead of starting a binary search to find the first required element, try dropping the last 1, 2, 4, 8, 16 etc. unknown elements, until you get a negative response (meaning: you dropped a required element). Then you know that the last chunk contains a required element, and you can do a binary search to find which one it is, as described in the blog post. Repeat until all elements are classified.
    
    This way, you can identify the sizes of gaps between required elements in O(log g) time each, and since in the worst case the gaps are approximately equal sized, the overall algorithm runs in O(m×log(n/m)) time. When m is much smaller than n, that’s O(m log n), just as in the blog post. But when m is close to n, it’s O(n), which is exactly as fast as the naive algorithm.
    
    (This is similar to exponential search.)
    
    Read less
- GL August 26, 2024 · Edited
  
  If the only a priori information is that some elements are essential, say an arbitrary non-empty subset is essential, then there are 2^n-1>2^(n-1) possible answers (if n>1), so a (binary) decision tree of depth at least n is necessary to fully determine the answer, so the naive algorithm is the best algorithm already.