Fuzzy Wuzzy
Spent the last week or so investigating and experimenting with fuzzy matching via recursion. There are two applications of it in the D-Lev librarian: for command hints if an unknown command is entered, and for the knob command page name resolution. Previously I was relying on prefix matching plus the minimum calculated Levenshtein distance between the pattern and candidates. Prefix works well, but despite the hype Levenshtein is a really poor fit - it's more of a crypto tool really.
I ran across this excellent explanation of fuzzy matching, with a functional example that's really fun to play with:
https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/#.d05n81yjy
The method entails comparing each letter in the pattern (in sequence) to each letter in the candidate (in sequence) and scoring/weighting certain special conditions associated with the individual character matches. A first letter match is obviously important, as are successive matches, camelCase matches, and matches that occur after a separator. Doing it in a best effort way requires an exhaustive exploration of the solution space and picking the highest score.
Fuzzy matching isn't built-in to the Go standard library, and the examples I found out there were kinda messy and hard to follow (and therefore trust). It may be possible to implement an exhaustive approach via 2 or more loops (?) but I decided to walk the solution space via recursion, another thing I have no real experience with. A loop is used to compare a given pattern char to the remaining candidate string, and with each match recurse to the next char in each, returning the score and a flag indicating whether the entire pattern was traversed or not. Once that was working I added the scoring weights.
Another function which deals with a list of candidates calls this and sorts the results based on score and match flag. There's a >100k word list in most Linux installs (/usr/share/dict/words) that I used for experimenting with the various weights. While it works absolutely fine for the purposes of the librarian, I'm not 1000% happy with the general results, but that's probably inevitable with almost any heuristic approach.
These sorts of exercises can be criticized as "reinventing the wheel" but they're fun and give me a deeper understanding of the concepts - and of the programming language itself. Designing my Hive processor, assembly language, math libraries, and simulator gave me an understanding computing that I could get nowhere else. To me it's somewhat akin to a musician practicing a piece, is that wasted time? Much school learning is working problems. Also 90% of the code out there is crap (which for all I know includes mine) so it's hard to want to pull in anything that isn't from an anointed, polished, and heavily banged on standard library.
[EDIT] A few small repairs to the fuzzy mechanics and scoring:
1. Making the best score update only at full pattern match eliminates any chance that a higher unmatched score gets reported as the final matched score.
2. Feeding the score forward rather than having it returned makes the best score updating and recursion somewhat cleaner.
3. Best score and match flag are now global within the function, with the recursion happening via a nested sub function.
4. Unmatched leading chars penalty is now -1 per char.
5. So the minimum unmatched score is then -len(target).
6. No need for match char bonus.
#1 precludes non-fully matched scoring, but whatever. I'm finally very happy with the results!
[EDIT2] Even more tinkering with the scoring:
7. Matched chars get a penalty of the distance from the start (-t_idx).
8. Matched chars get a bonus of 60 / (t_idx + 1).
9. A penalty of the length of the target is applied at the end of scoring (-t_len).
10. So the minimum unmatched score is then -2 * len(t_len).
#7 lightly biases everything towards the beginning chars of the target.
#8 is very non-linear and heavily biases the first chars of the target, and this bias continues inward for some distance in a milder fashion.
Integers are used for scoring weights, and integer resolution is kind of a problem, though the weights could be scaled up to mitigate this somewhat. I assume ints are used to maximize the speed.