more kmschaal's comments

kmschaal · on Feb 15, 2024

Thank you!

kmschaal · on Feb 15, 2024

Hi, thank you for your interest! I believe that most fuzzy search implementations lack accuracy in one aspect or another. The primary goals of my library are accuracy and query performance. However, I haven't looked into Fuse yet. I'm highly interested in hearing feedback from people who have tried both libraries with their datasets.

LoganDark · on Feb 15, 2024

What is your definition of "accuracy" within the context of fuzzy search?

kmschaal · on Feb 15, 2024

It's subjective, I have to admit. I would say a search is accurate if most people find what they are looking for in their dataset in the first try.

Distance definitions such as the Levenshtein and Damerau-Levenshtein distances provide a solid basis for discussions on accuracy. However, they are costly to compute and hence not widely adopted in fuzzy search libraries.

I started by using the known filter equation for the Levenshtein distance and computed a quality score with a leightweight formula. Then, I realized that the filter equation can be extended to the Damerau-Levenshtein distance by sorting the characters of the 3-grams.

In my tests, this implementation worked well. Please let me know how it works for you if you test it.

LoganDark · on Feb 16, 2024

> It's subjective, I have to admit. I would say a search is accurate if most people find what they are looking for in their dataset in the first try.

I'd say a search is accurate if it finds what most closely matches the query, for some definition of "matches". A search is useful if most people can find what they are looking for on the first try.

That is, a search being accurate doesn't necessarily translate to usefulness, if people don't (or can't) know how to write those accurate queries.

I'd imagine this is why fuzzy searches exist. Fuzzier queries allow for a larger spectrum of possible matches, which means a larger set of queries can turn up those results someone is looking for. Queries do not have to be as precise, and writing useful queries is easier.

But to me it seems diametrically opposed to accuracy. Usefulness is a much more intuitive measure, because the query does not have to be perfectly accurate in order to find the right result.

Alternatively, you could focus on the quality of ranking of the returned matches: how often the correct result is near the top (and how near) when the user finds what they are looking for. Ideally you want this as high as possible.

kmschaal · on Feb 16, 2024

Thank you for your explanation, I get your point. Regarding your last suggestion: I think it would be great to measure how often the correct result is near the top. However, don't we face the same issue as before? What is the correct result? Is it the term the user has in mind when writing the query? But what if they make a typo, and the term with the typo also exists in the dataset? Or they just type half of the term they are thinking of, but there are many terms in the dataset with the same prefix?

So, in the end, I believe it's worthwhile to try different implementations and share our subjective experiences.

LoganDark · on Feb 16, 2024

Yes, you are right.

kmschaal · on Sept 2, 2023

I think What3Words is in principle a nice idea, but it could have been implemented much better. I am thinking of the following:

- have a fixed pattern, e.g. adjective.adjective.noun.

- create groups of words and put them in hierarchies. E.g. noun->animal->predator

- cover the world with a one-dimensional Hilbert curve

- Increment the noun along the curve. When all nouns are exhausted, start with the first noun again and increment the adjective in the middle, a.s.o. (analog to how incrementing a number with several digits works).

With this approach, the location purple.flying.tiger would be next to purple.flying.lion.

lhl · on Sept 2, 2023

I spent a few minutes exploring a similar. Using the EFF wordlist (7776 words designed to be unambiguous)[1] you should be able to get down to about 100m^2 blocks (good enough for SAR), the BIP39 word list (2048) is a bit too resolution, but is a lot easier to pronounce, maybe worth some encoding shenanigans to get it to work).

For my POC, I also used a simple 1D hilbert curve. Running a simulation and plotting the words, you actually get pretty decent resolution, and it's even alphabetical by distance. Output from simple demo w/ BIP39 (math might not be correct): https://ibb.co/MnmcRFk but you can see that the word order actually make sense (although "winner" and "winter" are too phonetically similar, but still, gets the point across).

[1] https://www.eff.org/deeplinks/2016/07/new-wordlists-random-p...

hgomersall · on Sept 2, 2023

That's specifically problematic: what was their location again? Uhm, it was purple flying something.. some kind of big cat.. lion or tiger..

Large conceptual distance between close physical locations is a feature not a bug.

Gys · on Sept 2, 2023

It is not like that by design. Remember, this is backed by VCs that already put in a lot of money and in some future will want to make a lot of money.

The 3 words idea is patented and the basic idea is to have a drm protected database that is needed to find the actual location. The play the long game. Ultimately selling access to the database will make them money.

thelittlenag · on Sept 2, 2023

You should check out https://www.qalocate.com/solutions/geohashphrase/

kmschaal · on Aug 14, 2023

Thank you for your kind feedback. As a backend developer, I wouldn't have been able to build this site from scratch. Instead, I opted for an HTML5 UP! template (https://html5up.net/), which turned out to be one of the best 20 bucks I've ever spent.

kmschaal · on Aug 14, 2023

Thank you for sharing. The example, where the same sentence is presented with different labels, effectively illustrates the essence of the idea.

kmschaal · on Aug 14, 2023

Thank you for your comment. Your experience leads me to believe that the concept might be more useful for non-native speakers. The labels have helped us shorten lengthy PR discussions and focus on the factual aspects. For context: I work in a German company with an international team, and we use English as our business language

kmschaal · on Aug 14, 2023

The blog post explores the idea that code review comments should be more than just polite and respectful – they should convey their intent clearly and effectively. As developers, we're well aware of the complexity of code, and communication around it should be streamlined and efficient.

Semantic comments come with labels that convey their purpose, whether it's a simple remark, a question seeking an answer, a hint for future consideration, a suggestion open for discussion, an important point requiring change, or a crucial issue that must be addressed.

kmschaal · on Feb 19, 2023

I can second this; from my experience reading text backwards greatly helps with finding typos. The reason for this is presumably that in this way it is easier to force our brain to look at every single word in isolation. In normal (forward) reading we process more than one word at a time.

HN For You