R@M3$H.NBlog

[Dynamic Prog] Longest Common SubSeq - LCS

11 October, 2013 - 17 min read

Introduction

Dynamic programming have been a very interesting topic and technique to solve problems which exhibit a specific set of behaviors. We will try to understand the behaviors and the way it can be implemented to solve complex problem without requiring a mathematical approach. We will try to simplify it to make it clearly understandable and in later articles we will pick up two or three problem statements which can be easily solved using Dynamic Programming.

Purpose of the article

The purpose of this article is to enable the reader to analyze the complex programming problems, try to understand and employ the appropriate solution and help in learning dynamic programming. This may stretch to 2-3 articles to understand all the aspects of dynamic programming.

The Longest Common Subsequence Problem

This is a famous problem in the field of Biology to find the matching trends between two genes and many more application and it is also asked in most of the good interviews. Before describing this problem in detail, I would like to explicitly point down the difference between a substring and a subsequence.

Difference between Substring and Subsequence

A substring is a continuous sequence of characters from a string where as a subsequence is not necessarily a continuous sequence but maintains an order or characters.

Let us understand this with an example, where we have the below string

String 1: A L P H A N U M E R I C
We will list down few substrings and subsequences as below:
Substrings

  1. A L P H A
  2. N U M E R I C
  3. P H A N U
  4. M E R I

Subsequences

  1. A L P H A
  2. N M E R C
  3. A P H A U R I C

Above is a small list of substrings and subsequences. There can be many more, pointing it again, in a subsequence the order is important. We cannot say that A L N U C R is a subsequence.

Now with a better understanding of the differences, the problem statement becomes clearer.

Problem Statement

Given two strings of characters X[1...m] and Y[1...n], our job is to find a longest subsequence common to both. Point worth noting is that there may be more than one longest common subsequence in the given two strings.

For e.g. let us consider the below strings:
X : A B C B D A B
Y : B D C A B A

Let us try to list down the possible longest common subsequences in both the strings above.

  1. B D A B
  2. B C A B
  3. B C A B
  4. B C B A

First Approach

What exactly we did there? To simplify it, we chose one of the strings ( mostly we choose the string smaller in size) and started from the beginning of that string and kept comparing the character which comes in order to the characters in the second string. If they do not match and we reach the end of the second string, we skip the character and if it matches we move to the next character in both the strings.

This is called the Brute Force technique, it checks every subsequence of X[1...N] to see if it is also a subsequence of Y[1...M].. Hence, it involves finding all the possible subsequences of the first string. Then evaluates if any of the subsequence is also a subsequence of the second string.

Analyzing the Brute Force method

This is a very crude way of doing it, and also a time consuming one.
Part 1: Can we find out what is the total number of subsequences possible for a string of length N. The answer is very simple, its 2N. This is because every character in the string has a equal chance of appearing or not appearing in the subsequence. Hence at each character we have 2 options.

Now if there are only two characters in a string, then total number of subsequences would be four as below:

  1. An empty subsequence
  2. A subsequence containing the first character.
  3. A subsequence containing the second character.
  4. A subsequence containing both the characters.

Part 2: What is the time taken to find out if a given subsequence is also a subsequence of another string? It will take time proportional to the length of the second string in which we want to check. Let us say that the length of the string is M.

From part 1 and part 2 it is clear that the time taken to find out the longest common subsequence would be of the order M * 2N We can also say that the running time T = O(M * 2N)

Which is an exponential running time, assuming a string contains 1000 characters, the order of total number of subsequences would be 21000. Now this is tough to calculate or analyze. Hence, we need to find some other better performing approach.

We would try to simplify the problem so that we can solve it in polynomial time and not exponential time.

Second Approach

  1. Let the length of the longest common subsequence and let us say it is LS
  2. Extend the algorithm such that it finds the longest common subsequence using the LS itself.

What strategy should we employ?

Let me break down the problem to a smaller problem, consider the two strings in the below diagram.
sequenecs
One of the longest common subsequence of the above two strings is B C A B as listed above.

Now look at the strings below which merely are prefixes of the above strings:
sequences2
If I list down a longest common subsequence for the below strings I will get B C.

Point worth noting is that the longest common subsequence of the prefix strings, is a prefix of the longest common subsequence of the original strings. If this is a confusing line then I will put it in a simpler way.

The LCS(Longest common subsequence) of the strings in image 2 is B C which is a prefix of the LCS of the strings in image 1 i.e. B C A B

I am just trying to chop the problem down into smaller pieces.

Let us define few terms here.
X[1..i] is a prefix of X[1..M], and Y[1..j] is a prefix of Y[1..N]

Let us also define the length of longest common subsequence as Li,j which is equal to LCS of X[1..i] and Y[1..j].
Then by definition Lm,n would be equal to LCS of X[1..M] and Y[1..N].

Now assume that we are in the middle of the process where we have travelled to index i-1 in X and j-1 in Y and the length of the longest common subsequence according to our definition would be Li-1,j-1.

Now we compare the next characters X[i] and Y[j], there can be two cases:

1) If both of them match then Li,j would be Li-1,j-1 + 1, which is the one greater than the length of the last longest common subsequence.
2) If both of them don’t match then Li,j would be maximum of (Li,j-1 and Li-1,j), which is maximum of the length of the two longest common subsequences till now.

The statements above might not be easy to digest, so we will try and prove the statement so that we have a clear understanding.

Consider the strings below:
Case 1: if X[i] = Y[j]

Let us assume that Z[1..k] = LCS(X[1..i],Y[1..j]) , which means that the longest common subsequence of a prefix of X (upto index i) and prefix of Y (up to index j) is of length k and is represented by the sequence Z[1..k].

We can also say that Li,j is k, where Li,j is the longest sub sequence of X[1..i] and Y[1..j]. This also produces the result that Z[k] = X[i] = Y[j] because the Case 1 demands the X[i] to be equal to Y[j] and we defined Z[1..k] in a way that it is the longest common subsequence, so the last character of Z must be equal to X[i] or Y[j].

Case 2: if X[i] is not equal to Y[j] then it means that either X[i] or Y[j] is not a part of the longest common subsequence. So we cannot say much about Z[1..k] but we can definitely say that Z[1..k-1] is the longest common sub sequence of X[1..i-1], Y[1..j-1].

This is true because we are just considering the Z string to contain k-1 characters and hence there is no way by which there can be more than one common character between X[i-1], Y[j-1] and X[i], Y[j]

If this is still confusing, please write a comment and I will explain it more in detail.

The whole thing we discussed in last few 10-15 lines is just to come to an agreement on the statement that I made previously, which is “the longest common subsequence of the prefix strings is a prefix of the longest common subsequence of the original strings.” We can re-frame it to say that if Z is a LCS of X and Y then any prefix of Z is an LCS of a prefix of X and a prefix of Y.

Using this we can define a behavior of such kind of problems which says that, “An optimal solution to a problem contains and optimal solution to a subproblem.”

Now let us use the result of the above discussion to write a pseudo code to calculate the length of the longest common subsequence.
LCS
What will be the worst case scenario for this code?
It is pretty obvious that the worst case would be when X[i] != Y[j] for all value of i and j. I would like to elaborate this using a recursion tree.

Let us consider solving this when X is of length 4 and Y is of length 3, then we start at the root with (4,3) and the tree looks like the below in each subsequent level. every time we evaluate with i-1, j and i,j-1. The tree will grow till at least one argument in each of the leaves is not zero.
p3
What is the height of this tree?
It seems that it has to be M+N to accommodate all the cases and also, this is a binary tree hence at each level total number of nodes will increase by a power of 2, i.e. exponentially.

This seems to be almost the same as the Brute Force algorithm which we discussed in the beginning. Similar to the Brute force this also has a exponential running time.

I can’t stop noticing that in the tree most of the nodes contain same values. For e.g there are multiple nodes with values (2,2) and (3,1) etc. So we can apply some optimization here, also we can state that “the recursive solution contains few distinct sub problems repeated more than once.”

So, how many distinct sub problems are there, if most of them are repeated?

The answer would be M*N, because each unique sub problem in the tree is a function of i and j, and there are M distinct i and N distinct j, so total unique sub problems would be a product of M * N.

Hence, if we can employ some technique to remember what we already calculated and have a mechanism to fetch and return the previously calculate value, we might be saving lots of repetitive work.

We can actually maintain a map of i,j and the value Li,j, fetch and return Li,j if i and j are passed in to the above code. if there is nothing in the map then only we have to calculate it.

If we achieve it then the running time becomes proportional to the total number of unique problems. Hence, it will improve significantly because if a repeated sub problem occurs we simply return the already calculated value from the map.

So we can say that the running time to find the length of the Longest Common subsequence would be T(M,N) = O(M*N). Also, lets see the total space required, it would be the space required to save the already calculated solutions for the unique problems which is again proportional to M*N. Hence Space = O(M*N).

It is fairly visible that we started from the top of the tree where the values of i and j were maximum i.e. M and N respectively. So this can be termed as a top down approach, as we started from the max numbers and reduced it down to i = 0 and j = 0.

We can do the same thing in a better way if we can do it in a bottom up approach, because there might be many similar problems where we do not like to use recursion and we have the base case values handy. For e.g. in most of the cases while solving an equation in i and j we already know the result for i = 0 and j = 0, or i = 1 and j = 1, and it is fairly complicated to predict the result of i = m and j = n at the start of the calculation, so we must strive to use a bottom up approach to solve such complicated problems.

Introducing Dynamic Programming

And the bottom up approach to solve problems which exihibit traits as described above is called Dynamic Programming. Two re-iterate, the two similarities in all the problems which can be solved using the Dynamic Programming are:

  1. “An optimal solution to a problem contains and optimal solution to a sub problem.”
  2. “The recursive solution contains few distinct sub problems repeated more than once.”

Programming is a very old term and it was not always a synonym for ‘writing computer programs’. It meant to be a tabular way of calculation with rows and columns of instruction or data, and so are the programming languages. If you are familiar with the Assembly language, it also has the same tabular format of programming where each row contains a set of 2 or more columns and a row is an instruction which performs a task.

We use the tabular format to explain the solution to finding the length of the longest common subsequence using dynamic programming. We consider the same strings for this demonstration. Create a table with each character of first sequence as columns and each character of the second sequence as rows like the below table:
table
Now we definitely need to know how did we populate this table. Here is the rule,

We are evaluating each cell one at a time, suppose we are evaluating cell (i, j ) i.e. ith row and jth column, then we check if the character of row i matches the character of column j. Below two cases arise:
1) If the two characters are equal, we add one to the longest length calculated in the the previous row and column and store it in cell(i,j), it is the same case as the one we discussed where X[i] = Y[j], the length of the longest subsequence evaluates to be 1 + Li-1,j-1.
That means we take the value from cell(i-1,j-1), add 1 to it and store it in cell(i,j).

2) If the two characters are not equal, we find the maximum value among the two cell(i, j-1) and cell(i-1, j). This is again in accordance of the statements we proved previously in the discussion where X[i] != X[j], Li,j = max(L<sub<i-1,j)< sub="">, Li,j-1).

The first row and column will always have the value zero because they don’t contain any character.

Lets calculate the value of cell(2,2) assuming the index starts at 0 ,0. The second row contains character D and the second column contains character B. Both the characters don’t match so the value of cell(1,2) will have the value which is maximum of [cell(2,1) and cell(1,2)] i.e.e maximum of (0,1) which is 1. Hence, cell(2,2) will contain the value 1. This also means that the length of the longest common subsequence of the two strings A B and B D is one.

Now we calculate the value of cell(2,5), the second row contains the character D and the fifth row contains D as well. Both the characters match so the value of cell (2,5) will be 1 + the value of cell (2-1, 5-1) as per the discussions we had above. Li,j = Li-1, j-1 + 1.

In a similar way we can populate the whole table. The value in the cell (M,N) is the length of the longest common subsequence. In this case it is 4. So what is the running time of the algorithm?

Clearly it is the time taken to populate the whole table, there is a constant amount of work involved in looking at the two values, comparing them and figuring out the value of a given cell in the table, it actually depends on the previous calculations. And this constant work has to be done for M * N cells, so the total running time is proportional to M * N.
We can say that T = O(M * N) . We can also define it as a stricter bound by using the theta notation, but I don’t want to dwell into that at this moment of time, we can just understand that it is proportional to M * N.

Same with the space complexity, which looks to be proportional to M * N because we have used a table with M rows and N columns. Well that can be reduced to Max(M, N) as you can see to calculate the value of a given cell, we need the current row and 2 cells from the previous row.

Conclusion

So this is how we calculate the length of the longest subsequence, we also need to find the longest subsequences and write code for the same, which we plan to do in the next article. There we will calculate the actual time and space complexity of the whole algorithm along with part to find the LCS.

Following is C++ implementation of the above solution.

/* Dynamic Programming solution to find length of the longest common substring */
#include<iostream>
#include<string.h>
using namespace std;
// A utility function to find maximum of two integers
int max(int a, int b)
{   return (a > b)? a : b; }
/* Returns length of longest common substring of X[0..m-1] and Y[0..n-1] */
int LCSubStr(char *X, char *Y, int m, int n)
{
    // Create a table to store lengths of longest common suffixes of
    // substrings.   Notethat LCSuff[i][j] contains length of longest
    // common suffix of X[0..i-1] and Y[0..j-1]. The first row and
    // first column entries have no logical meaning, they are used only
    // for simplicity of program
    int LCSuff[m+1][n+1];
    int result = 0;  // To store length of the longest common substring
    /* Following steps build LCSuff[m+1][n+1] in bottom up fashion. */
    for (int i=0; i<=m; i++)
    {
        for (int j=0; j<=n; j++)
        {
            if (i == 0 || j == 0)
                LCSuff[i][j] = 0;
            else if (X[i-1] == Y[j-1])
            {
                LCSuff[i][j] = LCSuff[i-1][j-1] + 1;
                result = max(result, LCSuff[i][j]);
            }
            else LCSuff[i][j] = 0;
        }
    }
    return result;
}
/* Driver program to test above function */
int main()
{
    char X[] = "OldSite:GeeksforGeeks.org";
    char Y[] = "NewSite:GeeksQuiz.com";
    int m = strlen(X);
    int n = strlen(Y);
    cout << "Length of Longest Common Substring is " << LCSubStr(X, Y, m, n);
    return 0;
}

Output:

Length of Longest Common Substring is 10

Time Complexity: O(m*n)
Auxiliary Space: O(m*n)

END