Fuzzy Matching To The Rescue: Aligning Survey Design Across Time

Fuzzy Matching To The Rescue: Aligning Survey Design Across Time

Surveys are a valuable tool for any market research company. As a leading global information and measurement company, Nielsen has developed complex models and methodologies that hinge on the accuracy of survey data we use in our products. Survey data not only provides insights about what people watch, listen to and buy, but it also helps media companies define and reach their target audiences.

Obtaining these insights is not without challenges. Over time, surveys are typically modified to collect new data, or to improve the quality of the information collected from respondents. It’s not just that new questions get introduced, but old questions might receive a new treatment, often with new answer choices added to the mix. While this can greatly improve the value of a survey, those changes can introduce inconsistencies each time the survey is administered.

For instance, take this question: How frequently do you purchase dental floss in your household? Respondents have two predefined answer choices: ‘(1) 0-2 times in the past month’; and ‘(2) 3+ times in the past month’. To help tabulate the data and retain some meaning to the metadata, analysts decide to create two variables: ’Dental Floss: Light Users: 0-2 Times/Last Month: Total Category’ and ‘Dental Floss: Users: 3+ Times/Last Month: Total Category’. Why Total Category? Because there might be many variants in the market: waxed, multifilament, mint-flavored, etc.

Now suppose that six months later, the same survey is administered to a new group of respondents, with the same exact question, but the variable names have been changed to ‘Dental Floss: Times/Last Month: Light (0-2)’ and ‘Dental Floss: Times/Last Month: Heavy (3+)’ because those names are shorter, or we don’t care about different varieties after all, or they make more sense according to a new survey-wide naming convention. Wait another six months, and we might add a medium tier: ‘Dental Floss: Times/Last Month: Light (0-2)’, ‘Dental Floss: Times/Last Month: Medium (3-4)’ and ‘Dental Floss: Times/Last Month: Heavy (5+)’.

In real life, naming conventions change all the time, either on purpose or by accident. How then do we match that data over time? With the right domain expertise, the solution might be simple enough for one or two variables, but some surveys have thousands of variables. For example, at Nielsen, we’re working with one survey that contains attitudes, usage, and purchasing information for over 6,000 products and contains 20,000 variables across 26 categories. Every time it gets refreshed—twice a year—approximately 80% of the questions remain the same, and 20% involve new questions and modified answer choices. That means that 4,000 variables need to be examined and lined up against previous data.

Specifically, matching responses requires recognizing changes in formatting, choices, questions and categories, as well as identifying new additions and deletions. The manual effort takes two weeks—just for that one survey—and is prone to tabulation mistakes and errors of interpretation. That’s where machine learning can help. In particular, a type of algorithm that involves fuzzy string matching.

In string matching problems, the Levenshtein algorithm is a natural place to start. It’s a simple and efficient dynamic programming solution used to calculate the minimum number of character substitutions, insertions and deletions that are necessary to convert one word into another—that is, to minimize the “distance” between those two words. In our case, those words are the names of the survey labels (data fields) that may have changed from one survey iteration to the next and need to be harmonized to allow analysts to compute trends. Taking our solution one step further, we developed a model that broke down each label into separate sections—or cells—according to certain structural characteristics and computed the Levenshtein distance within each of those cells. And because we’re dealing with problems where thousands of such calculations need to take place in short order, we paralleled the code to apply it more efficiently to large problem sets.

Our innovative cell-based comparison model outperforms the existing word-based comparison models by a substantial margin, and we’re looking forward to sharing the details of our approach in an upcoming issue of the journal.