Boyer-Moore string search algorithm explanation and formal verification using CBMC
Previously I did this with Knuth-Morris-Pratt algorithm: part I, part II, part III.
So again, let's try to reinvent string-search algorithm.
Version 1
If a substring would be compared in reverse order, things may be different. For example, we search for a 'CAT' substring in 'HORSE TURTLE CAT' string.
With naive algorithm, we will first compare 'H' and 'C', we see they are unequal characters and we will advance substring one character ahead:
HORSE TURTLE CAT CAT ^
Right, but what if we will start at the end of 'CAT' substring proceeding back? We first compare 'R' with 'T' (last character of 'CAT') and we see that they are not equal. We could advance substring one character ahead, but... Can't we see that 'R' character in the middle of 'HORSE' part is absent in the 'CAT' substring? Can't this tell us something useful? Yes -- since 'R' is not in 'CAT', we can advance 'CAT' substring more, because we know that this will fail for sure:
HORSE TURTLE CAT CAT ^
This too:
HORSE TURTLE CAT CAT ^
So if we see a character not in 'CAT', we can advance 3 characters ahead:
HORSE TURTLE CAT CAT ^
But this will fail too, also, the space (' ') character is absent in 'CAT' too. OK, we advance 3 characters ahead:
HORSE TURTLE CAT CAT ^
No, 'R' is not in 'CAT', advance again 3 characters ahead:
HORSE TURTLE CAT CAT ^
Failing again... Adding 3 characters:
HORSE TURTLE CAT CAT ^
This is important. 'A' != 'T', but 'A' is in the 'CAT' substring, so this time we advance only 1 character ahead, as in naive search algorithm:
HORSE TURTLE CAT CAT ^
Here we will compare 3 characters and return success.
All this is way faster than naive string search algorith, and is a strong competitor with Knuth-Morris-Pratt algorithm.
This is the source code of a function that is hardcoded to search for 'cat' substring, nothing else:
unsigned search_cat_BM_v1 (char *s, unsigned len) { if (len<3) return len; // not found unsigned skip; for (unsigned i=0; i<len-2; ) { skip=1; // default skip char char3=s[i+2]; // read operation if (char3=='t') { char char2=s[i+1]; // read operation if (char2=='a') { char char1=s[i]; // read operation if (char1=='c') return i; // found else { // 1st character isn't 'c' if ((char1!='c') && (char1!='a') && (char1!='t')) skip=3; } } else { // 2nd character isn't 'a' if ((char2!='c') && (char2!='a') && (char2!='t')) skip=3; } } else { // 3rd character isn't 't' if ((char3!='c') && (char3!='a') && (char3!='t')) skip=3; } i=i+skip; } return len; // not found };
As compared with a naitve brute-force string algorithm, it shows a significant advantage:
** cacat search_cat_brute read_ops=7 search_cat_BM_v1 read_ops=5 ** cac cat search_cat_brute read_ops=10 search_cat_BM_v1 read_ops=5 ** cac cac cat search_cat_brute read_ops=17 search_cat_BM_v1 read_ops=7 ** horse turtle cat search_cat_brute read_ops=16 search_cat_BM_v1 read_ops=8 ** horse cat turtle search_cat_brute read_ops=9 search_cat_BM_v1 read_ops=5
Version 2
Now the next observation. If we search for 'CAT' and 'CACAT', we see that 'C' != 'T', but 'C' is in the 'CAT' substring, so we would advance by one character.
CACAT CAT ^
But... we can also align substring so that the first 'C' character in 'CAT' would match the next 'C' in 'CACAT', by advancing for 2 characters ahead:
CACAT CAT ^
We add a rule: "if a mismatched character in string exist in substring, align substring so that they would match" or, in other words, "if 'C' in string isn't equal to the last character in 'CAT' substring, advance by 2 characters":
if (char3=='t') { ... } else { // 3rd character isn't 't' if ((char3!='c') && (char3!='a') && (char3!='t')) skip=3; if (char3=='c') // added skip=2; // added } i=i+skip;
It may perform even better on some strings:
** cacat ** cacat search_cat_brute read_ops=7 search_cat_brute read_ops=7 search_cat_BM_v1 read_ops=5 search_cat_BM_v2 read_ops=4
As of aligning 'A' character in the middle of 'CAT' substring -- we wouldn't bother, because advancing substring by one character it's the same as is done by default.
Again, our algorithm is hardcoded only for 'CAT'.
And this is the universal algorithm from the R.Sedgewick's book, the first function construct a table of 'skips', the second does the actual search (Java code):
public BoyerMoore(String pat) { this.R = 256; this.pat = pat; // position of rightmost occurrence of c in the pattern right = new int[R]; for (int c = 0; c < R; c++) right[c] = -1; for (int j = 0; j < pat.length(); j++) right[pat.charAt(j)] = j; } ... public int search(String txt) { int m = pat.length(); int n = txt.length(); int skip; for (int i = 0; i <= n - m; i += skip) { skip = 0; for (int j = m-1; j >= 0; j--) { if (pat.charAt(j) != txt.charAt(i+j)) { skip = Math.max(1, j - right[txt.charAt(i+j)]); break; } } if (skip == 0) return i; // found } return n; // not found
( src )
And what if we search for 'COCOS' substring? What if characters are repeating within search pattern? Well, we will count only rightmost characters, ignoring two first 'CO'.
Formal verification using CBMC
I did that already with my homebrew Knith-Morris-Pratt algorithm.
But this time, CMBC have a hard time. It can't finish unwinding:
... Unwinding loop search_cat_BM_v1.0 iteration 42 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0 Unwinding loop search_cat_BM_v1.0 iteration 43 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0 Unwinding loop search_cat_BM_v1.0 iteration 44 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0 Unwinding loop search_cat_BM_v1.0 iteration 45 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0 Unwinding loop search_cat_BM_v1.0 iteration 46 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0 Unwinding loop search_cat_BM_v1.0 iteration 47 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0 ...
I think (not sure) CBMC have a problem with understanding loops like these:
for (unsigned i=0; i<len-2; ) { skip=1; ... { if ((char3!='c') && (char3!='a') && (char3!='t')) skip=3; } i=i+skip; }
In other words, where a loop's iterator is modified inside a loop body.
I had to unroll the loop:
unsigned search_cat_brute (char *s, unsigned len) { if (len<3) return len; // not found for (unsigned i=0; i<len-2; i++) { if (s[i]=='c' && s[i+1]=='a' && s[i+2]=='t') return i; }; return len; // not found }; unsigned loop_body (char *s, unsigned *i_ptr) { unsigned i=*i_ptr; // get i unsigned skip=1; // default skip char char3=s[i+2]; // read operation if (char3=='t') { char char2=s[i+1]; // read operation if (char2=='a') { char char1=s[i]; // read operation if (char1=='c') { return 1; // found } else { // 1st character isn't 'c' if ((char1!='c') && (char1!='a') && (char1!='t')) skip=3; } } else { // 2nd character isn't 'a' if ((char2!='c') && (char2!='a') && (char2!='t')) skip=3; } } else { // 3rd character isn't 't' if ((char3!='c') && (char3!='a') && (char3!='t')) skip=3; if (char3=='c') skip=2; } i=i+skip; *i_ptr=i; // store back return 0; }; unsigned search_cat_unrolled_v2 (char *s, unsigned len) { if (len<3) return len; // not found unsigned i=0; // for len=[3..15] if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; if (loop_body(s, &i)) return i; if (i>=(len-2)) return len; return len; // not found }; #ifdef CBMC void check() { unsigned len=LEN; char s[len]; __CPROVER_assert (search_cat_brute(s, len)==search_cat_unrolled_v2(s, len), "assert"); }; #endif
So crude, but worked!
% time cbmc --trace --function check -DLEN=14 -DCBMC BM_cat_1_v2.c CBMC version 5.10 (cbmc-5.10) 64-bit x86_64 linux Parsing BM_cat_1_v2.c Converting Type-checking BM_cat_1_v2 Generating GOTO Program Adding CPROVER library (x86_64) Removal of function pointers and virtual functions Generic Property Instrumentation Running with 8 object bits, 56 offset bits (default) Starting Bounded Model Checking Unwinding loop search_cat_brute.0 iteration 1 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 2 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 3 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 4 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 5 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 6 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 7 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 8 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 9 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 10 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 11 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 Unwinding loop search_cat_brute.0 iteration 12 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0 size of program expression: 2118 steps simple slicing removed 2 assignments Generated 1 VCC(s), 1 remaining after simplification Passing problem to propositional reduction converting SSA Running propositional reduction Post-processing Solving with MiniSAT 2.2.1 with simplifier 29262 variables, 75920 clauses SAT checker: instance is UNSATISFIABLE Runtime decision procedure: 0.36729s ** Results: [check.assertion.1] assert: SUCCESS ** 0 of 1 failed (1 iteration) VERIFICATION SUCCESSFUL
Now important note.
If you modify these lines so that this conditional expression would compare with other characters...
{ // 1st character isn't 'c' if ((char1!='c') && (char1!='a') && (char1!='t')) skip=3; }
CBMC will still verify this algorithm and prove its correctness, because a bug here wouldn't make the algorithm incorrect. It will degrade its performance to a level of naive string search algorithm, as if 'skip=3' here would be replaced with 'skip=1'. It will still work correctly, but slower.
But this algorithm is not finished.
I reimplemented it from R.Sedgewick book.
And he states:
It coincides with 'delta1' rule from the original Boyer-Moore paper. But there is also a 'delta2' rule, which is just another story for another blog post.
Files
Please drop me email about bug(s) and/or suggestion(s): blog@yurichev.com. List of other blog posts. BTW, I'm teaching. Follow me in social networks: Twitter, Telegram, GitHub, Discord, Facebook.
from Hacker News https://ift.tt/3v84VBW
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.