Find the Longest Repeated Substring with T-SQL


Find the longest repeated substring with T-SQL

Step 1: Create a number table based on the stirng length but it would be better to use an auxiliary numbers table;
Step 2: Cross join Numbers table to create positions and steps for all substring combinations;
Step 3: Calculate the occurrence of substrings in various lengths, the length for each substring and a sequence within a substring based on the step. The WHERE condition is to eliminate double count for any substrings after the last full step increaments;
Step 4: Get all substrings with at least two occurrrences;
Step 5: Remove repeating substring with overlapping and get the ranking order for the longest substrings;
Step 6: Retrieve the longest repeating substring(s) with two options. The option1 is to get longest string with tie breaker (first occurrence) by using rnLen=1 and the second option is to get longest strings without tie breaker by using rnkLen=1;

In order to change the default recursion limit from 100 to no limit for CTEs, the query hints MAXRECURSION is set to 0.

/************************************************************************************/
Declare @MyString varchar(max)
–Plain DNA Sequence Sample
http://www.genomatix.de/online_help/help/sequence_formats.html

Set @MyString =’ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC
CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC
CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG
AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC
CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG’

— Set @MyString = ‘3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233’

;WITH Numbers AS
(
SELECT 1 AS i
UNION ALL
SELECT i + 1 AS i FROM Numbers WHERE i = Step
)

,mycte3 as
(
SELECT Step,StartPos, SubSeq, maxLen, rnSubSeq
FROM mycte2
WHERE cntSubSeq>1
)

,mycte4 as
(
SELECT a.SubSeq, a.StartPos, a.maxLen
,RANK() OVER (order by a.maxLen DESC) AS rnkLen
,Row_Number() OVER (order by a.maxLen DESC,a.Step) AS rnLen
FROM mycte3 a
LEFT JOIN mycte3 b ON a.SubSeq=b.SubSeq and (a.rnSubSeq=b.rnSubSeq-1 OR b.rnSubSeq-1 = 0)
WHERE (a.StartPos+a.Step-1) < b.StartPos

)

SELECT SubSeq as [LognestRepeatingSubstring] FROM mycte4
WHERE rnkLen=1
Order by maxLen DESC, StartPos

OPTION (maxrecursion 0)

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s