{"id":2133,"date":"2016-09-24T00:01:31","date_gmt":"2016-09-24T07:01:31","guid":{"rendered":"http:\/\/c-for-dummies.com\/blog\/?p=2133"},"modified":"2016-09-17T13:12:24","modified_gmt":"2016-09-17T20:12:24","slug":"fuzzy-matching-now-with-forgiveness","status":"publish","type":"post","link":"https:\/\/c-for-dummies.com\/blog\/?p=2133","title":{"rendered":"Fuzzy Matching, Now With Forgiveness"},"content":{"rendered":"<p>Even when you add fudge to a matching system, occasionally that odd bit of data &mdash; the outlier &mdash; can wreck an otherwise close match. The question is, how many of those mismatches does your code allow?<br \/>\n<!--more--><br \/>\nAs an example, consider Figure 1.<\/p>\n<div id=\"attachment_2134\" style=\"width: 560px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2134\" src=\"http:\/\/c-for-dummies.com\/blog\/wp-content\/uploads\/2016\/09\/0924-figure1.png\" alt=\"Figure 1. A data set with really close values, plus two exceptions.\" width=\"550\" height=\"367\" class=\"size-full wp-image-2134\" srcset=\"https:\/\/c-for-dummies.com\/blog\/wp-content\/uploads\/2016\/09\/0924-figure1.png 550w, https:\/\/c-for-dummies.com\/blog\/wp-content\/uploads\/2016\/09\/0924-figure1-300x200.png 300w, https:\/\/c-for-dummies.com\/blog\/wp-content\/uploads\/2016\/09\/0924-figure1-450x300.png 450w\" sizes=\"auto, (max-width: 550px) 100vw, 550px\" \/><p id=\"caption-attachment-2134\" class=\"wp-caption-text\">Figure 1. A data set with really close values, plus two exceptions.<\/p><\/div>\n<p>You see two data sets, both of which graph a fairly similar line, listing values hovering around 60.0. Yet, in the <em>Sample<\/em> data set, two values are way off. They could be misreads or just anomalies. For 10 samples, that would be enough to throw the results, but for a greater number of samples, you could throw out those two items when doing a fuzzy comparison.<\/p>\n<p>To account for mismatches in my code, I create a <code>mismatch<\/code> variable. It&#8217;s triggered when the two values compare outside the variation, or the <em>fudge<\/em>.<\/p>\n<p>In this Lesson&#8217;s example, the values are all real numbers (<em>float<\/em>). To get the variation, the two array elements are compared:<\/p>\n<p><code>variation = fabs(target[x]-sample[x]);<\/code><\/p>\n<p>The <em>fabs()<\/em> function works like <em>abs()<\/em>, but it compares floating-point values. (Technically, <em>double<\/em> values.) It&#8217;s defined in the <code>math.h<\/code> header. The next line checks to see whether the variation is within the tolerance:<\/p>\n<p><code>if ( variation &gt; target[x]*tolerance )<\/code><\/p>\n<p>The variable <code>tolerance<\/code> is set to a percentage, such as <code>0.10<\/code> for 10 percent. If the value is outside of that range, then the fuzzy comparison fails &mdash; unless you allow for a given number of mismatches. In my code (below), I increment the <code>mismatch<\/code> variable for each outlier. Then a comparison ensures that mismatch total remains below 3. When the code hits the third mismatch, I just assume that the arrays are dissimilar.<\/p>\n<p>Here&#8217;s the full code:<\/p>\n<pre class=\"screen\">\r\n#include &lt;stdio.h&gt;\r\n#include &lt;math.h&gt;\r\n\r\n#define COUNT 10\r\n#define TRUE 1\r\n#define FALSE 0\r\n\r\nint main()\r\n{\r\n    float target[COUNT] = {\r\n        60.7, 60.4, 59.8, 61.2, 58.8,\r\n        61.1, 60.3, 59.7, 60.6, 60.0 };\r\n    float sample[COUNT] = {\r\n        58.7, 60.1, 61.0, 19.4, 61.8,\r\n        59.5, 75.0, 60.4, 59.2, 60.1 };\r\n    int x,match,mismatch;\r\n    float variation,tolerance;\r\n\r\n    match = TRUE;       <span class=\"comments\">\/* initialize match *\/<\/span>\r\n    tolerance = 0.10;   <span class=\"comments\">\/* percentage tolerance *\/<\/span>\r\n    mismatch = 0;       <span class=\"comments\">\/* number of mismatches *\/<\/span>\r\n\r\n    <span class=\"comments\">\/* compare arrays *\/<\/span>\r\n    for(x=0; x&lt;COUNT; x++)\r\n    {\r\n        variation = fabs(target[x]-sample[x]);\r\n        if( variation &gt; target[x]*tolerance)\r\n        {\r\n            mismatch++;\r\n            if(mismatch &gt; 2)\r\n            {\r\n                match = FALSE;\r\n                break;\r\n            }\r\n        }\r\n    }\r\n\r\n    <span class=\"comments\">\/* display results *\/<\/span>\r\n    if(match)\r\n        printf(\"The arrays are similar\\n\");\r\n    else\r\n        printf(\"The arrays are not similar\\n\");\r\n    printf(\"With %.f%% tolerance and %d mismatches\\n\",\r\n            tolerance*100,\r\n            mismatch);\r\n            mismatch);\r\n\r\n    return(0);\r\n}<\/pre>\n<p>Here&#8217;s sample output:<\/p>\n<pre><code>The arrays are similar\r\nWith 10% tolerance and 2 mismatches<\/code><\/pre>\n<p>You can adjust various items in the code to account for more or less fudge. The 10 percent value can be altered at Line 20. The number of mismatches is set at Line 30.<\/p>\n<p>You could craft the mismatch value based on the size of the arrays, which is a good idea. So for a 10-element array, a single mismatch might be enough &#038;mdash or not enough. Setting the number of allowable mismatches to a percentage of the array size is a good idea.<\/p>\n<p>The bottom line is that you can craft code to perform a fuzzy match, given a certain amount of wiggle room for the data plus a few mismatches to ensure that the arrays are similar.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Add a rule to allow for exceptions along with the fuzz. <a href=\"https:\/\/c-for-dummies.com\/blog\/?p=2133\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-2133","post","type-post","status-publish","format-standard","hentry","category-main"],"_links":{"self":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2133","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2133"}],"version-history":[{"count":3,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2133\/revisions"}],"predecessor-version":[{"id":2146,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2133\/revisions\/2146"}],"wp:attachment":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2133"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2133"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2133"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}