Monday, August 5, 2013

Longest increasing subsequence, considering contigous or not, starting with brute force and then optimizing

Introduction

 A way to solve problems like this is to keep a candidate list of subsolutions. You walk along the input and either add to an existing subsolution or start a new one. A brute force approach would be to keep track of all possible candidates and then verify to find which ones are best after the list of all of them is generated. However, a more efficient approach is to only add a candidate if you know that it may contribute, and prune away candidates if you know they will not contribute.

 

Longest Contiguous Increasing Subsequence


Say you want to find the longest contiguous increasing subsequence of an array.

(Using python syntax for description.)

For example:
 

X = [0, 8, 4, 9, 2, 3]


Brute force:
Walk across the list from left to right, and keep a list of possible solutions S you have found so far. This means that when you are at position i you do this:

Keep all lists already in S
Consider each list in S and add a new one that has X[i] appended to the end.

(This will also generate a new list that starts on X[i] because the empty list is in S.)

def longestContiguousSubsequence(X):



 # Staring with just one empty solution []. 

 S = [[]]

 for i in range(len(X)):
     for j in range(len(S)):
         newLists = []
         for s in S:
             copyOfList = list(s) 
             copyOfList.append(X[i])
             newLists.append(copyOfList)
         S += newLists


 max = 0
 maxList = []


 for s in S:
  if increasing(s):   # assume increasing returns true if the sequence is increasing
      if len(s) > max:
          max = len(s)
          maxList = s 

 return maxList

# test
print longestContiguousSubsequence(X)

This works but it does a lot of extra work.

For example, [0, 8] and [0, 4] are both put into S. However, you really don't need to keep both of them. The last element of [0, 4] is less than the last element of [0, 8], and there's no way that [0, 4] could do any better than [0, 8]. You can always add more increasing items to the end of [0, 4]. So, here's a rule: if you are about to add a new list it has to be "better" (it has to have more potential) than existing lists.

In here's what makes a list better:
(1) It's better if it has a lower number on the end.
(2) It's better if it's longer.

When you add a list, you are appending on new number. You're making a new list that's one longer, so it might better than the one that it comes from. But you want to compare it to other existing lists of equal length to see if it has a lower ending number, if it doesn't have a lower ending number, it's not useful. (Also if there are longer lists with lower numbers at the end, this one will not be useful, later on we can use this fact also.)

This means that really there can never be more than one useful list in S of a given length. The only list you would keep in S of a given length is one with the minimum ending value.

This means that we can have an array "Best" exactly the length of the list X that just stores a best list for each possible length. It will store a reference to a list at each element of the array. The first element is the best list of length 1, the second element is the best list of length 2, etc.

Since Python has zero based arrays, we'll actually use an array Best that has length of S + 1. The index of the array is equal to the lengh of the candidate sequence stored at that position.

Now, when you are adding a candidate list, and it turns out to be length n, you always look to the Best array to see if there's a list of length n that has a lower ending value, if so, you don't need to add it.

One more simplification: you don't need to store the whole lists in Best. You really only need to know where the list ends and the length of the list. That's enough to fully describe any subsequence. The index to Best is the length of the list, so if you just store where the list ends (one number) in each element of Best, that is enough.

Now, when you are about to add a new candidate list of length n by appending x, just look at Best[n] and it will give the the index i to the end of the best list of that length. If X[i] is smaller than x, you don't need to add your new list. If X[i] is bigger than x, you do need to add your new list, which just means replacing Best[n] with a new ending index.


 for i in range(len(X)):
   for lengthOfList in range(len(Best)):
     newLengthOfList = legthOfList + 1
     newEndOfList = X[i]
     if newEndOfList < Best[newLengthOfList]:
       Best[newLengthOfList] = X[i]


There's another optimization. If you can put x at the end of some subsequence s as above, then you don't need to put s at the end of any shorter subsequence. The shorter one would just end up having the exact same ending value and a shorter length, which can't possible be useful. So you really you only need to append to one of the candidate subsequences at most.

Walk along Best from highest index to lowest to find where to add to create a new list with one added. Then skip the others because they won't contribute to the final solution.

However, you don't really have to walk down the list one by one. The list represents the set of best solutions. Any solution A in the list that is shorter than solution B in the list will always have a lower number on the end (otherwise it would have been eliminated). So in a list of best solutions, the longer ones have higher values at the end. Because our Best array is indexed by length, it's also in order lowest last value to highest last value.

Longest Increasing Subsequence (Does not need to be contiguous)

This section is examining a different problem, where the goal is to find the longest increasing subsequence, and it does not need to be contiguous. Now I'm assuming that each candidates are being stored as a lists. (Other sources like Wikipedia have more efficient ways, but this is for explanation purposes.)

The approach to this is similar as described in the last section. You keep a list of best lists, indexed by length. You encounter a new number and you can create a new list with it appended to an existing candidate if it will create a new list that is "better" than those existing so far. (This includes the possibility of it creating a list longer than all of those existing so far and the possibility of creating a list that's just the same length as an existing but has a lower end value.)

Candidate solutions end up looking like this for example:
Best[1]:  0
Best[2]:  0, 4
Best[3]:  0, 4, 12
Best[4]:  0, 4, 12, 16

Notice that the ending number is always increasing. It must be, because if it was less than the previous, then the previous would not be useful. (Previous checks ensure that each list is useful, it has to be either longer or have a lower value at the end to get added.)

You could walk down Best from longest to shortest, one by one, checking the end value at each and append when you find the sufficiently small end value.

For example, if you found an 8, you could create a new list where it's appended to 0, 4 giving:
Best[1]:  0
Best[2]:  0, 4
Best[3]:  0, 4, 8
Best[4]:  0, 4, 12, 16

Or, if you found a 14, you'd want to make a new list with it appended to 0,4,12
Best[1]:  0
Best[2]:  0, 4
Best[3]:  0, 4, 12
Best[4]:  0, 4, 12, 14

In the examples, you can see that you really want to try adding it to the longest list with an ending value that's lower. (The only one that will work is the longest with an ending value lower. The ending values has to be lower for you to add it at all. And, if you tried one that wasn't longest, you'd always be trying to make a new list that's no better than one you already have based on ending value... because the one after it would have a lower ending value.) For example, if ending values are 0, 4, 12, 14, then 8 can only be extending 0, 4, there is not other option.


Considering ending values:
0 4 12 14
       ^
You can find where you'd want to put 8 with a binary search of the list's end values.
Binary search for 8, which will put you between value 4 and 12 (index 2 and index 3), so you'd want a new sublist extending the list shown at Best[2] and replacing Best[3].

In summary the algorithm is effectively maintaining a list of candidates like this for example:

Best[1]:  0
Best[2]:  0, 4
Best[3]:  0, 4, 12
Best[4]:  0, 4, 12, 14

At each iteration, walking down the input:
  It can put in a new best (Best[5] in this example) by appending

An optimization: Making new lists by copying is expensive. Here's more efficient way to get the job done. Represent the list as linked lists. For example:
0 4 12
(0) <- 3="" array="" p="" position="">
Now when you want to add 14. Create node with 14 and just set its point to (12) of this list (0) <- p="">



Above essentially creates an array of nodes where each of the nodes's points certain other previously created node.


Instead of using nodes, standard algorithms accomplish the same with a "parent" array.


You were creating one node at each iteration i (which each check of an input value). You can just put an index in an array p instead. Create an array of parents, each points to a child index (Very much like a node. Each element of the array is an index which is associated with another index. Just like each node is a reference to a number that's associated with a reference to another node.) For examples: At the addition of 4, the pointer p[1] would be set to point to the position of 0, which is i=0. At the addition of 12 (at i=2) the pointer (just a number) p[2] would point to position of the array that holds 4 (which is i=1).

N is length of array (assume its indices start at 1)
Note that each index that p stores can point you to a number in the input (if that's what you want) or to a previous node in p (if that's what you want). (Just like each node will give you either the data of the node or the next node.)
So p[N] points to the position in the array of the last value. (It also points to a position in the parent array.)
p[p[N]] points to the position in the array of the second to last value. (It also points to a position in the parent array.)











No comments: