README.md 14 KB

dtl

dtl is the diff template library written in C++. The name of template is derived C++'s Template.

Table of contents

Features

dtl provides the functions for comparing two sequences have arbitrary type. But sequences must support random access_iterator.

Getting started

To start using this library, all you need to do is include dtl.hpp.

#include "dtl/dtl.hpp"

Compare two strings

First of all, calculate the difference between two strings.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.compose();

When the above code is run, dtl calculates the difference between A and B as Edit Distance and LCS and SES.

The meaning of these three terms is below.

Edit Distance Edit Distance is numerical value for declaring a difference between two sequences.
LCS LCS stands for Longest Common Subsequence.
SES SES stands for Shortest Edit Script. I mean SES is the shortest course of action for tranlating one sequence into another sequence.

If one sequence is "abc" and another sequence is "abd", Edit Distance and LCS and SES is below.

Edit Distance 2
LCS ab
SES C a C b D c A d
  • 「C」:Common
  • 「D」:Delete
  • 「A」:ADD

If you want to know in more detail, please see examples/strdiff.cpp.

This calculates Edit Distance and LCS and SES of two strings received as command line arguments and prints each.

When one string is "abc" and another string "abd", the output of strdiff is below.

$ ./strdiff abc abd
editDistance:2
LCS:ab
SES
 a
 b
-c
+d
$

Compare two data has arbitrary type

dtl can compare data has aribtrary type because of the C++'s template.

But the compared data type must support the random access_iterator.

In the previous example, the string data compared,

dtl can also compare two int vectors like the example below.

int a[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int b[10] = {3, 5, 1, 4, 5, 1, 7, 9, 6, 10};
std::vector<int> A(&a[0], &a[10]);
std::vector<int> B(&b[0], &b[10]);
dtl::Diff< int > d(A, B);
d.compose();

If you want to know in more detail, please see examples/intdiff.cpp.

Merge three sequences

dtl has the diff3 function.

This function is that dtl merges three sequences.

Additionally dtl detects the confliction.

typedef char elem;
typedef std::string sequence;
sequence A("qqqabc");
sequence B("abc");
sequence C("abcdef");
dtl::Diff3<elem, sequence> diff3(A, B, C);
diff3.compose();
if (!diff3.merge()) {
  std::cerr << "conflict." << std::endl;
  return -1;
}
std::cout << "result:" << diff3.getMergedSequence() << std::endl;

When the above code is run, the output is below.

result:qqqabcdef

If you want to know in more detail, please see examples/strdiff3.cpp.

Patch function

dtl can also translates one sequence to another sequence with SES.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
string s1(A);
string s2 = d.patch(s1);

When the above code is run, s2 becomes "abd". The SES of A("abc") and B("abd") is below.

Common a
Common b
Delete c
Add    d

The patch function translates a sequence as argument with SES. For this example, "abc" is translated to "abd" with above SES.

Please see dtl's header files about the data structure of SES.

Difference as Unified Format

dtl can also treat difference as Unified Format. See the example below.

typedef char elem;
typedef std::string sequence;
sequence A("acbdeaqqqqqqqcbed");
sequence B("acebdabbqqqqqqqabed");
dtl::Diff<elem, sequence > d(A, B);
d.compose();             // construct an edit distance and LCS and SES
d.composeUnifiedHunks(); // construct a difference as Unified Format with SES.
d.printUnifiedFormat();  // print a difference as Unified Format.

The difference as Unified Format of "acbdeaqqqqqqqcbed" and "acebdabbqqqqqqqabed" is below.

@@ -1,9 +1,11 @@
 a
 c
+e
 b
 d
-e
 a
+b
+b
 q
 q
 q
@@ -11,7 +13,7 @@
 q
 q
 q
-c
+a
 b
 e
 d

The data structure Unified Format is below.

/**
 * Structure of Unified Format Hunk
 */
template <typename sesElem>
struct uniHunk {
  int a, b, c, d;                   // @@ -a,b +c,d @@
  std::vector<sesElem> common[2];   // anteroposterior commons on changes
  std::vector<sesElem> change;      // changes
  int inc_dec_count;                // count of increace and decrease
};

The actual blocks of Unified Format is this structure's vector.

If you want to know in more detail, please see examples/unistrdiff.cpp and examples/unidiff.cpp and dtl's header files.

In addtion, dtl has the function translates one sequence to another sequence with Unified Format.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
d.composeUnifiedHunks()
string s1(A);
string s2 = d.uniPatch(s1);

When the above code is run, s2 becomes "abd". The uniPatch function translates a sequence as argument with Unified Format blocks.

For this example, "abc" is translated to "abd" with the Unified Format block below.

@@ -1,3 +1,3 @@
 a
 b
-c
+d

Compare large sequences

When compare two large sequences, dtl can optimizes the calculation of difference with the onHuge function.

This function is available when the compared data type is std::vector.

When you use this function, you may call this function before calling compose function.

typedef char elem;
typedef  std::vector<elem> sequence;
sequence A;
sequence B;
/* ・・・ */
dtl::Diff< elem, sequence > d(A, B);
d.onHuge();
d.compose();

Unserious difference

The calculation of difference is very heavy. dtl uses An O(NP) Sequence Comparison Algorithm.

Though this Algorithm is sufficiently fast, when difference between two sequences is very large,

the calculation of LCS and SES needs massive amounts of memory.

dtl avoids above-described problem by dividing each sequence into plural subsequences and joining the difference of each subsequence finally.

As this way repeats allocating massive amounts of memory, dtl provides other way. It is the way of calculating unserious difference.

For example, The normal SES of "abc" and "abd" is below.

Common a
Common b
Delete c
Add    d

The unserious SES of "abc" and "abd" is below.

Delete a
Delete b
Delete c
Add    a
Add    b
Add    d

Of course, when "abc" and "abd" are compared with dtl, above difference is not derived.

dtl calculates the unserious difference when dtl judges the calculation of LCS and SES needs massive amounts of memory and unserious difference function is ON.

dtl joins the calculated difference before dtl judges it and unserious difference finally.

As a result, all difference is not unserious difference when unserious difference function is ON.

When you use this function, you may call this function before calling compose function.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onUnserious();
d.compose();

Calculate only Edit Distance

As using onOnlyEditDistance, dtl calculates the only edit distance.

If you need only edit distance, you may use this function, because the calculation of edit distance is lighter than the calculation of LCS and SES.

When you use this function, you may call this function before calling compose function.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onOnlyEditDistance();
d.compose();

Algorithm

The algorithm dtl uses is based on "An O(NP) Sequence Comparison Algorithm" by described by Sun Wu, Udi Manber and Gene Myers.

An O(NP) Sequence Comparison Algorithm(following, Wu's O(NP) Algorithm) is the efficient algorithm for comparing two sequences.

Computational complexity

The computational complexity of Wu's O(NP) Algorithm is averagely O(N+PD), in the worst case, is O(NP).

Comparison when difference between two sequences is very large

Calculating LCS and SES efficiently at any time is a little difficult.

Because that the calculation of LCS and SES needs massive amounts of memory when a difference between two sequences is very large.

The program uses that algorithm don't consider that will burst in the worst case.

dtl avoids above-described problem by dividing each sequence into plural subsequences and joining the difference of each subsequence finally. (This feature is supported after version 0.04)

Implementations with various programming languages

There are the Wu's O(NP) Algorithm implementations with various programming languages below.

https://github.com/cubicdaiya/onp

Examples

There are examples in dtl/examples. dtl uses SCons for building examples and tests. If you build and run examples and tests, install SCons.

strdiff

strdiff calculates a difference between two string sequences, but multi byte is not supported.

$ cd dtl/examples
$ scons strdiff
$ ./strdiff acbdeacbed acebdabbabed
editDistance:6
LCS:acbdabed
SES
  a
  c
+ e
  b
  d
- e
  a
- c
  b
+ b
+ a
+ b
  e
  d
$

intdiff

intdiff calculates a diffrence between two int arrays sequences.

$ cd dtl/examples
$ scons intdiff
$ ./intdiff # There are data in intdiff.cpp
1 2 3 4 5 6 7 8 9 10 
3 5 1 4 5 1 7 9 6 10 
editDistance:8
LCS: 3 4 5 7 9 10 
SES
- 1
- 2
  3
+ 5
+ 1
  4
  5
- 6
+ 1
  7
- 8
  9
+ 6
  10
$

unidiff

unidiff calculates a diffrence between two text file sequences, and output the difference between files with unified format.

$ cd dtl/examples
$ scons unidiff
$ cat a.txt
a
e
c
z
z
d
e
f
a
b
c
d
e
f
g
h
i
$ cat b.txt
a
d
e
c
f
e
a
b
c
d
e
f
g
h
i
$ ./unidiff a.txt b.txt
--- a.txt       2008-08-26 07:03:28 +0900
+++ b.txt       2008-08-26 03:02:42 +0900
@@ -1,11 +1,9 @@
 a
-e
-c
-z
-z
 d
 e
+c
 f
+e
 a
 b
 c
$

unistrdiff

unistrdiff calculates a diffrence between two string sequences. and output the difference between strings with unified format.

$ cd dtl/examples
$ scons unistrdiff
$ ./unistrdiff acbdeacbed acebdabbabed
editDistance:6
LCS:acbdabed
@@ -1,10 +1,12 @@
 a
 c
+e
 b
 d
-e
 a
-c
 b
+b
+a
+b
 e
 d
$

strdiff3

strdiff3 merges three string sequence and output the merged sequence. When the confliction has occured, output the string "conflict.".

$ cd dtl/examples
$ scons strdiff3
$ ./strdiff3 qabc abc abcdef
result:qabcdef
$

There is a output below when conflict occured.

$ ./strdiff3 adc abc aec
conflict.
$

intdiff3

intdiff3 merges three integer sequence(vector) and output the merged sequence.

$ cd dtl/examples
$ scons intdiff3
$ ./intdiff3
a:1 2 3 4 5 6 7 3 9 10
b:1 2 3 4 5 6 7 8 9 10
c:1 2 3 9 5 6 7 8 9 10
s:1 2 3 9 5 6 7 3 9 10
intdiff3 OK
$

patch

patch is the test program. Supposing that there are two strings is called by A and B, patch translates A to B with Shortest Edit Script or unified format difference.

$ cd dtl/examples
$ scons patch
$ ./patch abc abd
before:abc
after :abd
patch successed
before:abc
after :abd
unipatch successed
$

fpatch

fpatch is the test program. Supposing that there are two files is called by A and B, fpatch translates A to B with Shortest Edit Script or unified format difference.

$ cd dtl/examples
$ scons fpatch
$ cat a.txt
a
e
c
z
z
d
e
f
a
b
c
d
e
f
g
h
i
$ cat b.txt
$ cat b.txt
a
d
e
c
f
e
a
b
c
d
e
f
g
h
i
$ ./fpatch a.txt b.txt
fpatch successed
unipatch successed
$

Running tests

dtl uses googletest and SCons with testing dtl-self.

Building test programs

If you build test programs for dtl, run scons in test direcotry.

$ scons

Running test programs

If you run all tests for dtl, run 'scons check' in test direcotry. (it is necessary that gtest is compiled)

$ scons check

If you run sectional tests, you may exeucte dtl_test directly after you run scons. Following command is the example for testing only Strdifftest.

$ ./dtl_test --gtest_filter='Strdifftest.*'

--gtest-filters is the function of googletest. googletest has many useful functions for testing software flexibly. If you want to know other functions of googletest, run ./dtl_test --help.

Old commit histories

Please see cubicdaiya/dtl-legacy.

License

Please read the file COPYING.