Fuzzy Matching, Now With Forgiveness

Even when you add fudge to a matching system, occasionally that odd bit of data — the outlier — can wreck an otherwise close match. The question is, how many of those mismatches does your code allow?

As an example, consider Figure 1.

Figure 1. A data set with really close values, plus two exceptions.

Figure 1. A data set with really close values, plus two exceptions.

You see two data sets, both of which graph a fairly similar line, listing values hovering around 60.0. Yet, in the Sample data set, two values are way off. They could be misreads or just anomalies. For 10 samples, that would be enough to throw the results, but for a greater number of samples, you could throw out those two items when doing a fuzzy comparison.

To account for mismatches in my code, I create a mismatch variable. It’s triggered when the two values compare outside the variation, or the fudge.

In this Lesson’s example, the values are all real numbers (float). To get the variation, the two array elements are compared:

variation = fabs(target[x]-sample[x]);

The fabs() function works like abs(), but it compares floating-point values. (Technically, double values.) It’s defined in the math.h header. The next line checks to see whether the variation is within the tolerance:

if ( variation > target[x]*tolerance )

The variable tolerance is set to a percentage, such as 0.10 for 10 percent. If the value is outside of that range, then the fuzzy comparison fails — unless you allow for a given number of mismatches. In my code (below), I increment the mismatch variable for each outlier. Then a comparison ensures that mismatch total remains below 3. When the code hits the third mismatch, I just assume that the arrays are dissimilar.

Here’s the full code:

#include <stdio.h>
#include <math.h>

#define COUNT 10
#define TRUE 1
#define FALSE 0

int main()
{
    float target[COUNT] = {
        60.7, 60.4, 59.8, 61.2, 58.8,
        61.1, 60.3, 59.7, 60.6, 60.0 };
    float sample[COUNT] = {
        58.7, 60.1, 61.0, 19.4, 61.8,
        59.5, 75.0, 60.4, 59.2, 60.1 };
    int x,match,mismatch;
    float variation,tolerance;

    match = TRUE;       /* initialize match */
    tolerance = 0.10;   /* percentage tolerance */
    mismatch = 0;       /* number of mismatches */

    /* compare arrays */
    for(x=0; x<COUNT; x++)
    {
        variation = fabs(target[x]-sample[x]);
        if( variation > target[x]*tolerance)
        {
            mismatch++;
            if(mismatch > 2)
            {
                match = FALSE;
                break;
            }
        }
    }

    /* display results */
    if(match)
        printf("The arrays are similar\n");
    else
        printf("The arrays are not similar\n");
    printf("With %.f%% tolerance and %d mismatches\n",
            tolerance*100,
            mismatch);
            mismatch);

    return(0);
}

Here’s sample output:

The arrays are similar
With 10% tolerance and 2 mismatches

You can adjust various items in the code to account for more or less fudge. The 10 percent value can be altered at Line 20. The number of mismatches is set at Line 30.

You could craft the mismatch value based on the size of the arrays, which is a good idea. So for a 10-element array, a single mismatch might be enough &mdash or not enough. Setting the number of allowable mismatches to a percentage of the array size is a good idea.

The bottom line is that you can craft code to perform a fuzzy match, given a certain amount of wiggle room for the data plus a few mismatches to ensure that the arrays are similar.

Leave a Reply