Longest Common Subsequence: Python, C++ Example

What is Longest Common Subsequence?

Longest Common Subsequence (LCS) means you will be given two strings/patterns/sequences of objects. Among these two sequences/strings, you need to find the longest subsequence of elements in the same order present in both strings or patterns.

Example

For example, there are two strings provided.

Let’s assume that,

Pattern_1 = “RGBGARGA”
Pattern_2 = “BGRARG”

  • From pattern_1, sequences can be produced like “RGB”, “RGGA”, ”RGAR”. For creating a sequence, you need to maintain the relative position of each character in the string.
  • From pattern_2, we can produce sequences like “BGR”, ”BRAG”, ”RARG”. Sequences can be produced as long as they maintain the original string’s relative position.

The term relative position means order.

For example, “BRG” is a valid sequence because “B” appeared first, then “R,” and then “G” in the original string pattern_2. However, if a sequence is “RBRG”, it’s not valid. Because in the original string (pattern_2), “B” comes first.

Longest Common Subsequence

We have two options to find the Longest Common Subsequence from the given two sequences or arrays.

  • Naive method
  • Dynamic Programming Solution: Longest Common Subsequence is also known as LCS.

A naive Solution has larger time complexity and is not the optimal solution. Using Dynamic Programming Solution (DP), we overcome the complexity problem.

Naive Method

Naive method is a simple approach to the problem regardless of the time complexity and other optimization factors.

The Naive method consists of “Brute Force”, multiple loops, recursive methods in most cases.

The term brute force means going through all possible patterns for a given problem.

Example

From the above example of pattern1 and pattern2, let us assume pattern1 has a length of m and pattern2 has a length of n. To check every possible case, we need to evaluate every possible subsequence of pattern1 with pattern2.

Here’s a simple 4 letter string “ABCD”. For example, we need to create a sequence from “ABCD.” Either we can take a character or not. That means that, for each character, we have two choices.

They are:

  • The character will be added to the subsequence.
  • The character will not be added to the subsequence.

Here, the images show all the sequences that we can make from the string “ABCD”.

Naive Method

Sequence with 1 character:

Naive Method

Sequences with 2 characters:

Naive Method

Sequences with 3 characters:

Naive Method

From the above diagram, there’re 14 sequences. If we do not take letters, basically an empty string, the total sequences will be 15. Moreover, the string “ABCD” itself is a sequence. So, the total sequences are 16.

So, It is possible to generate 24 or 16 subsequences from “ABCD”. Then, a string with a length of m will have to total subsequence of 2m.

For each subsequence, we need to check it for the whole pattern2. It will take 0(n) time. 0(n) means the complexity function that calculates the time taken for execution.

So, total time complexity becomes O(n*2m). For the example, we have seen above the value of m=8 and n=5.

Here are the steps of the Naive Method:

Step 1) Take a sequence from the pattern1.
Step 2) Match the sequence from step1 with pattern2.
Step 3) If it matches, then save the subsequence.
Step 4) If more sequence is left in the pattern1, then go to step 1 again.
Step 5) Print the longest subsequence.

Optimal Substructure

The term optimal substructure means an optimal solution (simple) can be found by solving the subproblems. For example, in the above example, we have pattern1 and pattern2.

Step 1) Take the first two characters from each pattern

Step 2) Take the third to fifth characters from each pattern.

Step 3) Continue similarly with the remaining characters.

Optimal Substructure
Recursive Structure of LCS problem

We find the LCS on the sub-string (string generated from an original string). Then we keep the record of the length of the LCS of the substrings.

Now, here is another interesting property that is overlapping sub-problems. A problem is said to have overlapping sub-problems if the problem statement can be broken into small subproblems and used several times in the program.

The diagram below shows that the recursive algorithm called the function with the same parameter several times.

Optimal Substructure

For example, let’s see the recursion tree.

In the dark-colored box, you can notice overlapping sub-problems. (“RG”, “RA”), (“RG”,”R”) and others are called several times.

For optimizing this, we have the approach of Dynamic Programming (DP).

Recursive Method of Longest Comm Sequence

The graph is shown above is the recursive method. Each recursive function has a base case to break the recursion or start returning from its stack.

For this implementation, we’ll use a base case. So, the algorithm is like the following:

  • If all elements before the last element have a match, then increase the length by one and return
  • Pass two patterns to the function, and take the max value of the return
  • If one pattern has zero length, then we have no subsequence to compare. Return 0 in this case. This is the base case of the recursion.

Pseudo Code:

def lcs:
    input: pattern_1, pattern_2, len_1, len_2
if len_1 or len_2 is zero:
    then
return 0
if pattern_1[len_1 - 1] equals pattern_2[len_2 - 1]
then
return 1 + lcs(pattern_1, pattern_2,
    len_1 - 1, len_2 - 1)
else
    max of lcs(pattern_1, pattern_2, len_1 - 1, len_2),
    lcs(pattern_1, pattern_2, len_1, len_2 - 1)

Implementation in C++:

#include<iostream>
#include<bits/stdc++.h>
using namespace std;
int lcs(string pattern_1, string pattern_2, int len_1, int len_2) {
  if (len_1 == 0 || len_2 == 0)
    return 0;
  if (pattern_1[len_1 - 1] == pattern_2[len_2 - 1]) {
    return 1 + lcs(pattern_1, pattern_2, len_1 - 1, len_2 - 1);
  } else {
    return max(lcs(pattern_1, pattern_2, len_1 - 1, len_2), lcs(pattern_1, pattern_2, len_1, len_2 - 1));
  }
}
int main() {
  string pattern_1, pattern_2;
  pattern_1 = "RGBGARGA";
  pattern_2 = "BGRARG";
  cout<<"Length of LCS is: "<<lcs(pattern_1, pattern_2, pattern_1.size(), pattern_2.size())<<endl;
}

Output:

Length of LCS is: 5

Implementation in python:

def lcs(pattern_1, pattern_2, len_1, len_2):
    if len_1 == 0 or len_2 == 0:
    return 0
if pattern_1[len_1 - 1] == pattern_2[len_2 - 1]:
    return 1 + lcs(pattern_1, pattern_2, len_1 - 1, len_2 - 1)
else :
    return max(lcs(pattern_1, pattern_2, len_1 - 1, len_2), lcs(pattern_1, pattern_2, len_1, len_2 - 1))
pattern_1 = "RGBGARGA"
pattern_2 = "BGRARG"
print("Lenght of LCS is: ", lcs(pattern_1, pattern_2, len(pattern_1), len(pattern_2)))

Output:

Lenght of LCS is:  5

Dynamic Programming method of Longest Common Subsequence (LCS)

Dynamic programming means optimizing the plain recursive method. For example, if we see the recursive or naive approach graph, we can see that there are several same function calls. The Dynamic Programing method records all the calculations in an array and uses it when needed.

We’ll use a 2D array with dimensions of m x n, where m and n are lengths of pattern1 and pattern2.

For 2D array, we can use List data structures in python or vector/array data structures in C++.

Pseudo Code for LCS using DP:

LCS(pattern_1, pattern_2):
    m = length of pattern_1 + 1
n = length of pattern_2 + 1
dp[n][m]
for i in range 0 to n + 1:
    for j in range 0 to m + 1:
    if i or j equals to 0:
    dp[i][j] = 0
else if pattern_1[i] == pattern_2[j]:
    dp[i]][j] = dp[i - 1][j - 1] + 1
else :
    dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
return dp[n][m]


Here’s the table LCS that is used as a 2D array data structure for the dynamic programming approach.

Dynamic Programming method of lcs

Let’s discuss the logic we used here. Steps are:

Step 1) If i or j is zero, we are taking an empty string from the given two strings and trying to find the common subsequences. However, as the substring we’re taking is empty, the subsequence length is 0.

Step 2) If two character matches, we’ll assign the value to the (i,j) index by incrementing the previously calculated LCS, which is present in the (i-1,j-1) index (from the previous row).

Step 3) If it doesn’t match, then we will take the max LCS of the adjacent two indexes. And this way, we need to fill all the values in the 2D array.

Step 4) Finally, we will return the value of the last cell of the 2D array.

Basically, all the value in the 2D array contains the length of common subsequences. Among these, the last cell contains the length of the longest common subsequence.

Implementation in C++:


#include<iostream>
using namespace std;
int lcs(string pattern_1, string pattern_2) {
  int m = pattern_1.size();
  int n = pattern_2.size();
  // dp will store solutions as the iteration goes on
  int dp[n + 1][m + 1];
  for (int i = 0; i & lt; n + 1; i++) {
    for (int j = 0; j & lt; m + 1; j++) {
      if (i == 0 || j == 0) {
        dp[i][j] = 0;
      } else if (pattern_2[i - 1] == pattern_1[j - 1]) {
        dp[i][j] = dp[i - 1][j - 1] + 1;
      } else {
        dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]);
      }
    }
  }
  return dp[n][m];
}
int main() {
  string pattern_1 = "RGBGARGA";
  string pattern_2 = "BGRARG";
  cout<<"Length of LCS: "<<lcs(pattern_1, pattern_2)<<endl;
}

Output:

Length of LCS: 5

Implementation in Python:

def lcs(pattern_1, pattern_2):
    m = len(pattern_1)
n = len(pattern_2)
# dp will store solutions as the iteration goes on
dp = [
    [None] * (n + 1) for item in range(m + 1)
]
for i in range(m + 1):
    for j in range(n + 1):
    if i == 0 or j == 0:
    dp[i][j] = 0
elif pattern_1[i - 1] == pattern_2[j - 1]:
    dp[i][j] = dp[i - 1][j - 1] + 1
else :
    dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
return dp[m][n]
pattern_1 = "RGBGARGA"
pattern_2 = "BGRARG"
print("Length of LCS: ", lcs(pattern_1, pattern_2))

Output:

Length of LCS: 5

So, both the strings have the longest common subsequence of length 5.

In a nutshell, we are simply calculating each task once in the DP method in the DP method. In the recursive method, we might have overlapping subproblems.

In this Dynamic Programming Algorithm, we are using a 2D matrix. There will be two strings given (assume both have length n). Then the space needed in the array is n x n. If the strings are large enough, we will need a memory-optimized version of the DP solution.

The simplified logic that was taken in the code is:

  • Declare a 2D Array DP[m][n].
  • Fill the first row and first column of the DP array with 0.
  • Take i and j for the iteration.
  • If pattern1[i] equals to pattern2[j], then update DP[i][j] = DP[i-1][j-1] + 1
  • If pattern1[i] not equals to pattern2[j], then DP[i][j] will be the maximum value between DP[i-1][j] and DP[i][j-1].
  • Continue until i and j reach m and n.
  • The last element or DP[m-1][n-1], will contain the length.

Here, it’s addressed as DP[m-1][n-1], because array index begins from 0.

Summary:

  • The DP method has lower time complexity; it is O(mn), where m and n are the length of the input string or array.
  • DP is a faster approach than the recursive one, with the time complexity of O(n*2m).
  • Dynamic programming (DP) is not memory optimized. We used a 2D array that has the length of m*n. So, the space complexity is (m*n).
  • Recursive method, in a worst-case scenario, the highest memory it will occupy will be m+n, basically the total length of the inputted string.
  • Today’s modern computer is sufficient to handle this amount of memory.