Fuzzy Matching with Percentage Variation

A fuzzy match that uses discrete amount of fudge might not yield a match, especially when the values cover a wide range. To better make the match work, set a percentage variation instead.

For example, suppose the variation is fixed at 2.0. For the values 1.2 and 1.7, the fuzzy match works. But if the values are 209.7 and 206.2, the match would fail. If both sets of numbers are in the same series, then you might want to try a percentage variation instead of a fixed, or discrete, value.

In my code, I base the percentage on the value fetched from the first array. For example, if element target[0] is 24, the tolerance is 10 percent of that value, or 2.4. As long as the comparison value, sample[0], is within the range of 24 ±2.4, the values match.

When the values get larger, the percentage increases and so does the tolerance. If target[9] is set to 95, then its 10 percent tolerance is 95 ±9.5. The value of sample[9] can be in the range of 85.5 (int 85) to 104.5 (int 105) and the fuzzy match passes.

Here is sample code, based on the code from last week’s Lesson:

#include <stdio.h>
#include <stdlib.h>

#define COUNT 10
#define TRUE 1
#define FALSE 0

int main()
{
    int target[COUNT] = { 24, 28, 32, 45, 50,
                          66, 67, 70, 80, 95 };
    int sample[COUNT] = { 26, 26, 30, 42, 50,
                          61, 67, 75, 85, 99 };
    int x,match,variation;
    float tolerance;

    match = TRUE;       /* initialize match */
    tolerance = 0.10;   /* percentage tolerance */

    /* compare arrays */
    for(x=0; x<COUNT; x++)
    {
        variation = abs(target[x]-sample[x]);
        if( (float)variation > (float)target[x]*tolerance)
        {
            match = FALSE;
            break;
        }
    }

    /* display results */
    if(match)
        printf("The arrays are similar, within %.f%% tolerance\n",
                tolerance*100);
    else
        printf("The arrays are not similar, within %.f%% tolerance\n",
                tolerance*100);

    return(0);
}

The float variable tolerance is set to the fuzz factor, the percentage value that the comparison can fluctuate. It’s set to 0.10 in Line 18, which is 10 percent.

Remember that percentages are decimal values: 50 percent is 0.50 not 50.0.

The for loop at Line 21 plows through each of the two arrays’ elements. At Line 23 in the loop, the difference between each elements’ values is calculated. The abs() function sets a positive value difference.

The if comparison at Line 24 multiplies the original array element’s value by the tolerance percentage. When the result is greater than the value of variation, the match fails. The values of variation and element[x] are typecast to float variables so that the result is calculated accurately.

Lines 33 and 36, both display the results and display the percentage value used for the comparison.

Here is a sample run:

The arrays are similar, within 10% tolerance

You can set the value of variable tolerance to another percentage in Line 18. So if you want to use 5 percent, set the value to 0.05.

In this sample code, however, anything less than 10 percent tolerance fails the match. That’s because the smaller values in the samples require a higher tolerance than the larger values. Discrete values make more sense for this type of comparison, still it’s puzzling when you look at Figure 1 and see how similar the lines are that a 10 percent tolerance doesn’t result in a positive fuzzy match.

Figure 1. The two data sets graphed as lines.

Figure 1. The two data sets graphed as lines.

The solution is to add another level of forgiveness. I cover that fuzzy feature in next week’s Lesson.

Leave a Reply