Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

被引：2

作者：

Nikolic, Bojan ^{[1
]}

Kartelj, Aleksandar ^{[2
]}

Djukanovic, Marko ^{[1
,3
]}

Grbic, Milana ^{[1
]}

Blum, Christian ^{[4
]}

Raidl, Gunther ^{[3
]}

机构：

[1] Univ Banja Luka, Fac Nat Sci & Math, Banja Luka 78000, Bosnia & Herceg

[2] Univ Belgrade, Fac Math, Belgrade 105104, Serbia

[3] TU Wien, Fac Informat, Inst Log & Computat, A-1040 Vienna, Austria

[4] Artificial Intelligence Res Inst IIIA CSIC, Campus UAB, Bellaterra 08193, Spain

来源：

MATHEMATICS | 2021年 / 9卷 / 13期

关键词：

longest common subsequence problem; multi-nomial distribution; probability-based search guidance; BEAM SEARCH; ALGORITHMS; SEQUENCES;

D O I：

10.3390/math9131515

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

The longest common subsequence (LCS) problem is a prominent NP-hard optimization problem where, given an arbitrary set of input strings, the aim is to find a longest subsequence, which is common to all input strings. This problem has a variety of applications in bioinformatics, molecular biology and file plagiarism checking, among others. All previous approaches from the literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform probability distributions of letters in the input strings. In this paper, we introduce an approach that is able to effectively deal with more general cases, where the occurrence of letters in the input strings follows a non-uniform distribution such as a multinomial distribution. The proposed approach makes use of a time-restricted beam search, guided by a novel heuristic named Gmpsum. This heuristic combines two complementary scoring functions in the form of a convex combination. Furthermore, apart from the close-to-uniform benchmark sets from the related literature, we introduce three new benchmark sets that differ in terms of their statistical properties. One of these sets concerns a case study in the context of text analysis. We provide a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with fixed beam size in order to evaluate the guidance abilities of the compared search heuristics; and (2) long-time executions with fixed target duration times in order to obtain high-quality solutions. In both settings, the newly proposed approach performs comparably to state-of-the-art techniques in the context of close-to-uniform instances and outperforms state-of-the-art approaches for non-uniform instances.

引用

页数：25

共 47 条

[1] Repetition-free longest common subsequence
Adi, Said S.
Braga, Marilia D. V.
Fernandes, Cristina G.
Ferreira, Carlos E.
Martinez, Fabio Viduani
Sagot, Marie-France
Stefanes, Marco A.
Tjandraatmadja, Christian
Wakabayashi, Yoshiko
[J]. DISCRETE APPLIED MATHEMATICS, 2010, 158 (12) : 1315 - 1324
[2] [Anonymous], 1997, ACM SIGACT NEWS
[3] A beam search approach to the container loading problem
Araya, I.
Riff, M. -C.
[J]. COMPUTERS & OPERATIONS RESEARCH, 2014, 43 : 100 - 107
[4] Impact of familiarity on information complexity in human-computer interfaces
Bakaev, Maxim
[J]. 2016 INTERNATIONAL CONFERENCE ON MEASUREMENT INSTRUMENTATION AND ELECTRONICS (ICMIE 2016), 2016, 75
[5] A new algorithm for "the LCS problem" with application in compressing genome resequencing data
Beal, Richard
Afrin, Tazin
Farheen, Aliya
Adjeroh, Donald
[J]. BMC GENOMICS, 2016, 17
[6] A survey of longest common subsequence algorithms
Bergroth, L
Hakonen, H
Raita, T
[J]. SPIRE 2000: SEVENTH INTERNATIONAL SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL - PROCEEDINGS, 2000, : 39 - 48
[7] Beutelspacher A., 1996, KRYPTOLOGIE, V7
[8] Blum C., 2016, Metaheuristics for String Problems in Bioinformatics, P45
[9] Blum C, 2007, LECT NOTES COMPUT SC, V4638, P150
[10] Beam search for the longest common subsequence problem
Blum, Christian
Blesa, Maria J.
Lopez-Ibanez, Manuel
[J]. COMPUTERS & OPERATIONS RESEARCH, 2009, 36 (12) : 3178 - 3186

← 1 2 3 4 5 →