A colleague told me that there was an O(N) algorithm for finding a duplicated item in an array of N integers in the range 1 to N − 1. There must be a duplicate due to the pigeonhole principle. There might be more than one duplicated value; you merely have to find any duplicate.¹
The backstory behind this puzzle is that my colleague had thought this problem was solvable in O(N log N), presumably by sorting the array and then scanning for the duplicate. They posed this as an interview question, and the interviewee found an even better linear-time algorithm!
My solution is to interpret the array as a linked list of 1-based indices, and borrow the sign bit of each integer as a flag to indicate that the slot has been visited. We start at index 1 and follow the indices until they either reach a value whose sign bit has already been set (which is our duplicate), or they return to index 1 (a cycle). If we find a cycle, then move to the next index which does not have the sign bit set, and repeat. At the end, you can restore the original values by clearing the sign bits.²
I figured that modifying the values was acceptable given that the O(N log N) solution also modifies the array. At least my version restores the original values when it’s done!
But it turns out the interview candidate found an even better O(N) algorithm, one that doesn’t modify the array at all.
Again, view the array values as indices. You are looking for two nodes that point to the same destination. You already know that no array entry has the value N, so the entry at index N cannot be part of a cycle. Therefore, the chain that starts at N must eventually join an existing cycle, and that join point is a duplicate. Start at index N and use Floyd’s cycle detector algorithm to find the start of the cycle in O(N) time.
¹ If you constrain the problem further to say that there is exactly one duplicate, then you can find the duplicate by summing all the values and then subtracting N(N−1)/2.
² I’m pulling a fast one. This is really O(N) space because I’m using the sign bit as a convenient “initially zero” flag bit.
I feel like I'm missing something. I would have approached this exercise as:
Make a separate array of length N-1 with values set to 0. Iterate through the input array and use its value as an index to the new array. If the value is 0 then increment it and continue to the next elements, but if the value is 1 then you've found your duplicate. Is that not O(N) time and O(N-1) space, since you're just iterating through the input array once and adding to memory an array of length N-1?
Chasing cycles is clever, sure, and it wouldn't shock...
I probably would have used an N-2 bitset if not permitted to modify the given elements. Each number corresponds to a bit, you’ll find the duplicate easily in O(N) since it will already be set. It does require allocating for large N though.
EDIT: Why does the blog software replace normal dashes with emdashes??