August 26th, 2024

3 reactions

Thoughts on finding the essential elements of a set

Raymond Chen

Suppose you have a set of n items, two of which are essential, and the rest are superfluous. You can pass any subset of these items to an oracle, and the oracle will tell you whether the set contains all of the essential elements. The objective is to identify those essential elements.

You might run into this problem if the elements are package dependencies, and you want to figure out which ones are actually necessary for your project to build, and which ones are just cargo cult.

If the problem had been formulated with just one essential element, then it would be a simply binary search: Divide the set into equal-sized subsets and ask the oracle which subset contains the essential element. Recurse on that subset, and you can find the essential element in O(log n) steps.

But what if there are two essential elements? You could try the same thing and divide in half, but if the oracle says “Neither half contains both of the essential elements,” then you’re in a bit of a pickle because you don’t know which pieces of the two halves need to be combined.

One option is to try to peel off the essential elements one at a time. For example, an inefficient algorithm would be to remove one element and ask the oracle of the remaining elements include all the essential elements. If it says yes, then you can recurse with the smaller set. if it says no, then you know that the element you removed is one of the essential elements, and you can now use the “find one essential element” algorithm on the rest. (Just remember to add the essential element you already found to each query you pass to the oracle.)

Now that we have an inefficient algorithm, we can try to make it more efficient: Instead of removing one element at a time, you can use a binary search to find the “highest-numbered essential element”: At the start, you know that the (zero-based) index of the highest-numbered essential element is somewhere between 1 and n − 1,¹ inclusive. At each step, find the midpoint between the low and high boundaries of the range and ask the oracle whether all the elements up to that midpoint element include all the essential elements. If so, then you can move the upper boundary of the range down to the midpoint; if not, then you can move the lower boundary of the range up to the midpoint. In this way, you can do a binary search on “the highest-numbered essential elements.”

And then once you’ve found one essential element, you can use a regular binary search to find the other one.

We can generalize this to the case where there are m essential elements: Start with a known range of (m − 1) to (n − 1), and use binary search to find the highest-numbered essential element. Once you’ve done that, you’ve reduced the problem to finding the m − 1 essential elements below the highest-numbered essential element, and so on, for a total complexity of O(m log n).

¹ You know that it cannot be zero, because there are two essential elements, and the earliest you can get both of them is if one of them is at index 0 and the other is at index 1.

Topics

Code

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

Discussion is closed. Login to edit/delete existing comments.

Ian Yates August 31, 2024

Reminds me of a YouTube video I just watched last night. It's an efficient way to solve the flashlight & 4 good vs 4 bad batteries problem. Flashlight only works if it's given 2 good batteries simultaneously - so 1 good + 1 bad, or 2 bad, means no light. You need to find 2 good batteries - how many steps are required in worst case?
Surprisingly it's 7 (including the final "test" where you know it'll work). It comes down to exploiting that you *know* there are 4 good and 4 bad, which is not the same as...
Read more
Reminds me of a YouTube video I just watched last night. It’s an efficient way to solve the flashlight & 4 good vs 4 bad batteries problem. Flashlight only works if it’s given 2 good batteries simultaneously – so 1 good + 1 bad, or 2 bad, means no light. You need to find 2 good batteries – how many steps are required in worst case?
Surprisingly it’s 7 (including the final “test” where you know it’ll work). It comes down to exploiting that you *know* there are 4 good and 4 bad, which is not the same as “at least 4 good”. Naive approach has you doing 20+ tests in the worst case as that’s not using all provided knowledge.
Search for “The famous batteries and flashlight logic puzzle” to find it (or at least the version I watched, as there are a few) on a channel called MindYourDecisions.

Read less
Jonas Barklund August 27, 2024 · Edited

Sorry if I’m repeating myself, I cannot see my previous reply so trying one more time.
If one thinks of this as a sorting problem where element A < element B if A is essential and B is not, then it seems fairly obvious that one can find the essential elements in an array of n elements in O(n log n) time.
alan robinson August 27, 2024

Depends on who's efficiency you care about, the programmer or the CPU. This if it's the programmer, a genetic algorithm (GA) inspired approach where you are evolving which part to discard would do the job nicely with minimal problem-specific code. All you do is randomly discard half the set and submit it to the oracle; if it passes restart with that as the new set to search. Otherwise, submit a new random permutation of the current set. Also has the upside of generalizing to unknown (n), and should be much faster than the simplest discard 1 solution that...
Read more
Depends on who’s efficiency you care about, the programmer or the CPU. This if it’s the programmer, a genetic algorithm (GA) inspired approach where you are evolving which part to discard would do the job nicely with minimal problem-specific code. All you do is randomly discard half the set and submit it to the oracle; if it passes restart with that as the new set to search. Otherwise, submit a new random permutation of the current set. Also has the upside of generalizing to unknown (n), and should be much faster than the simplest discard 1 solution that Raymond starts out with, but barely more complex to implement, or could even be fully formed in your toolbox already as it’s a generic search method.

Read less
David Gershnik August 26, 2024

A different algorithm, though one unfortunately requiring exact knowledge of m (how many essential elements).
Partition the set into m+1 parts. I'll number them starting at 1. (For m=2, thirds)
Test the whole set except part 1 - If yes, eliminate part 1 and restart with your smaller set
If no, test the whole set except part 2. Continue like this until one of your tests says yes. (One of them will by the pigeonhole principle)
Each iteration eliminates 1/m+1 of the set, meaning that the algorithm is log(n) with the base of the log being (m+1)/m

As for the constant...
Read more
A different algorithm, though one unfortunately requiring exact knowledge of m (how many essential elements).
Partition the set into m+1 parts. I’ll number them starting at 1. (For m=2, thirds)
Test the whole set except part 1 – If yes, eliminate part 1 and restart with your smaller set
If no, test the whole set except part 2. Continue like this until one of your tests says yes. (One of them will by the pigeonhole principle)
Each iteration eliminates 1/m+1 of the set, meaning that the algorithm is log(n) with the base of the log being (m+1)/m

As for the constant coefficient, for each test the probability that no essential items are in that partition, i.e. that every essential item is NOT in that partition is ((m-1)/m)^m, or in the limit, 1/e. We should therefore expect to do e tests during each iteration. This is the part I am most shaky on, my probability/stats is rusty.

Unclear if this is more efficient than repeated binary searches but was the first thing that came to mind for me.

Read less

Stay informed

Get notified when new posts are published.

Email *

Country/Region *

I would like to receive the The Old New Thing Newsletter. Privacy Statement.

Follow this blog

Thoughts on finding the essential elements of a set

Category

Topics

Author

9 comments

Read next

On the strange status of `wchar_t` in classic MIDL

How is the Windows.Foundation.Uri.Domain property different from Host?

Category

Topics

Share

Author

9 comments

Read next

On the strange status of wchar_t in classic MIDL

How is the Windows.Foundation.Uri.Domain property different from Host?

Stay informed

On the strange status of `wchar_t` in classic MIDL