The Long Line Problem

The original problem can be found on the book The Art of Computer Programming (TAOCP), Volume 3, Second Edition, Section 5, Exercise 24. The book is famous for many reasons, one of which being that it provides a lot of interesting exercises; this one has a very simple description and yet it lends itself to a nice discussion.

Problem Description

A few billion people formed a line. They all received a piece of paper, and in it they wrote:

The first person threw away their piece of paper without writing anything. The other pieces of paper were collected, shuffled, and given to you in the form of a “vector of pieces of paper”.

Your goal is to reconstruct the original line by printing the names of each person, from last to first.

Programming Interface

struct Paper {
  Name author;
  Name in_front;
};

void reconstruct(vector<Paper> &entries) {
  // Print the names in the original order.
}

One Solution

Common solutions involve creating auxiliary data structures like sets or dictionaries; let’s explore how to do it without using extra memory.

Reconstructing Part of the Line

We can choose an arbitrary Name some_name and reconstruct the line in front of that person:

When the search in Step 1 fails, we’ve found a name who is not the author of any piece of paper: that’s the first person in the line, so we’re done.

For example, suppose we were given this input and want to reconstruct the line in front of Alice:

We look for Alice’s paper, then Fig’s, then Bob’s, then Gil’s and that’s when we fail.

Each search (dotted line) can potentially scan the entire input, leading to linear complexity on each search. By sorting the papers according to their author, we can binary search instead, making each search logarithmic in complexity.

void reconstruct_from(const Name &some_name, vector<Paper> &entries) {
  auto get_author = &Paper::author;

  ranges::sort(entries, std::less(), get_author);

  auto next_entry = [&](const Name& name) {
    return ranges::lower_bound(entries, name,
                               std::less(), get_author);
  };

  for (auto it = next_entry(some_name);
            it != entries.end();
            it = next_entry(it->in_front))
    std::cout << it->in_front << "\n";
}

If we have N pieces of paper, the complexity of this algorithm is O(N*logN):

Identifying the Last Person

We’ve shown how to print the line starting at an arbitrary person; if we call reconstruct_from with the last person as input, we’re done. But how do we identify the last person in line? Let’s look at the line again:

These properties suggest the following algorithm: for each author, search for that name in the in_front entries. If the search fails, that’s the last person in line. To make the searching step efficient, we sort the entries according to the in_front field.

Name find_last_person(vector<Paper> &entries) {
  auto get_in_front = &Paper::in_front;
  ranges::sort(entries, std::less(), get_in_front);

  auto is_last_person = [&](const Name &name) {
    return !ranges::binary_search(entries, name,
                                  std::less(), get_in_front);
  };

  auto get_author = &Paper::author;
  auto it = ranges::find_if(entries, is_last_person,
                            get_author);
  return it->author;
}

If we have N pieces of paper, the complexity of this algorithm is O(N*logN):

Putting It All Together

All that’s left to do is to combine the two main routines into a final solution:

void reconstruct(vector<Paper> &entries) {
  auto last = find_last_person(entries);
  std::cout << last << "\n";
  reconstruct_from(last, entries);
}

Mutability Gotcha

The implementation above mutates the input vector by sorting in-place in order to avoid unnecessary data copies. However, this requires extra care when using references to elements of the vector.

For example, let’s change the signature of find_last_person to return a reference inside entries:

const Name& find_last_person(vector<Paper> &entries);

void reconstruct(vector<Paper> &entries) {
  const auto &last = find_last_person(entries); // Reference!!
  reconstruct_from(last, entries);
}

This is going to fail for a subtle reason: reconstruct_from will sort entries, and this will change what last points to. You can experiment with this faulty program.

More Efficient Representations for Name

When looking for the last person in line, we’re trying to find the only name that does not appear in a list of names. If names are the N integers from 0 to N-1, we can use the “XOR trick” to find the missing number.

Let’s XOR all the N numbers from 0 to N-1 with all the N-1 in_front entries:

(0⊕in_front₀) ⊕ (1⊕in_front₁) ⊕ ... ⊕ (n−2⊕in_front_n − 2) ⊕ n − 1

Every number in the range from 0 to N-1 appears twice, with the exception of the missing number. Because XOR is commutative, the expression above is equivalent to:

Since the XOR of a number with itself is zero, the whole expression is equal to missing_number.

Name find_last_person(vector<Paper> &entries) {
  auto get_in_front = [](const Paper &p)
                      { return p.in_front; };
  auto in_fronts = entries |
                   std::views::transform(get_in_front);
  auto xor_in_fronts = std::accumulate(
      std::begin(in_fronts), std::end(in_fronts),
      0, std::bit_xor());

  auto indices = std::ranges::iota_view{
                 0,
                 1 + static_cast<int>(entries.size())};
  auto last_person = std::accumulate(
      std::begin(indices), std::end(indices),
      xor_in_fronts, std::bit_xor());

  return last_person
}

This algorithm is linear in the length of entries. The complexity of reconstruct_from remains the same.

The Donald Knuth Solution

What we discussed here is NOT the intended solution to the original problem. The book adds a few extra constraints to the problem: it states that the input is in a tape, i.e., storage without random access but available in great amounts. It also states we have some limited amount of random access memory, a fancy way of saying that we are allowed to sort the input as needed.

The reference solution is complex, juggling five lists at the same time. It starts like this: create a list with the authors and a list with the in_fronts, and sort both lists. With those, we can find the first and last person in line by zipping over both lists. The pairs produced by the zip should contain identical names, except when one of them is the first or last person in line. When this happens, how do we know whether we found the first or last person?

I will leave you with that question. Even this only scratches the surface of the full solution.