Data Cleaning: How It Works and Why You Should Do It

Jan 21, 2021 | Marketplace

If you’re looking for powerful insights, you want to ensure that your survey is getting the right responses.  Of course, the first step to fielding an effective study is building a well-designed survey – but it doesn’t end there. What’s next? Data cleaning! 

To get the most actionable and reliable responses from a survey, data cleaning is just as important as survey design. Data cleaning can help you determine the relevance, reliability, and accuracy of survey responses. It’s an imperative step to take before making insights-based decisions – which is why we recommend that all Lucid buyers conduct data cleaning as they receive survey responses. 

What exactly is data cleaning?

Data cleaning is the process of reviewing the data you’ve collected, to ensure respondent attentiveness and response validity. In general, we give survey respondents the benefit of the doubt – since they’ve opted in to provide answers and receive an incentive for completing your survey. Data cleaning simply ensures the data collected is high quality and reliable so that it can be used to make important business decisions.

As we mentioned, Lucid expects our customers to perform data checks and data cleaning on the survey responses they collect. Following data cleaning, buyers can reconcile any unusable completes, and they are not held financially accountable. 

When should you conduct data cleaning?

There are a few different phases of a study when you should conduct data cleaning. The Lucid team has first-hand experience with data cleaning on projects that we program and host. We’ll share our process for cleaning data and making recommendations for removals, as we recommend the same steps for buyers programming their own surveys. 

Step 1: Pre-Launch Data Checks 

In addition to quality assurance in survey programming, we run simulated data through the survey platform to perform data checks on all survey elements before launching the project. We check the data to ensure the following elements are working as intended:

  • Survey logic (including screening conditions, skipping conditions, etc.)
  • Quota qualifications, logic, and limits
  • All data is being captured for required questions
  • Required questions are not skipped
  • Response labels and data map are matching
Step 2: Soft Launch Data Checks

As a standard, we recommend that customers soft launch for about 10% of the total required sample or 100 completes, whichever is less. We use the data collected during the soft launch to perform the aforementioned data checks on live respondent data before proceeding to collect the entire sample.

Step 3: Full Launch Data Cleaning

After the soft launch, we perform data cleaning twice over the course of survey fielding:

  1. 60% Data Collection
  2. 90% Data Collection

How do you conduct data cleaning?

Typically, we clean the data looking at the following survey elements and question types, though each may not be applicable to every survey:

Length of Interview

Looking at the survey based on the amount of time a respondent spent on a particular question, or the survey as a whole, is important. It can indicate areas where the respondent may have selected responses without thoroughly reading the question or carefully thinking about their response. 

As a standard, Lucid looks at the median LOI as the expected time it takes to complete the survey. The industry definition of a “speeder” is any respondent who has completed a survey in less than ⅓ of the median LOI. 

By default, we remove speeders from the survey results – and we add survey validation to automatically terminate respondents who complete within the designated speeder threshold time or less. We use the quality term redirect to communicate to our Marketplace partners why the respondent was terminated. 

Please note that survey validation must be implemented after a soft launch (not before), as the only way to accurately gauge LOI is with surveys that are in-field. Setting a speeder term prior to launch might – and often does – term valid respondents. 

Straightlining / Patterned Responses

Another area of data cleaning is to look at the responses on grid questions in the survey. If a respondent is answering the same answer option (“C”)  over and over, their engagement in the survey may be suspect. As a default, we flag respondents who select the same response for at least five rows in a grid.

Respondents also create patterns on grid questions, though these are less obvious in the data, and thus harder to identify. 


At Lucid, we think about data cleaning during the programming process, and we often program in validation to flag respondents who straightline specific questions in a survey. 

Respondents who create patterns or pictures with their responses must be manually identified, though visualizing the data can help to identify these respondents. 

Text Open End Questions

Lucid recommends asking only one open ended question for every five minutes of respondent time to yield the best results. So, for a 15 minute survey, only three open ended questions are recommended. 

Too many open ended questions can lead to respondent fatigue. Additionally, too many open ended questions can be a good indicator of the need to do qualitative research before creating a quantitative online survey. See our blog on quantitative and qualitative research to help determine if your study should be approached differently.

To clean open end responses, it is helpful to sort in alphabetical order to quickly and easily spot nonsensical text or characters such as “good” or “dfksjfdkj.”

Inconsistent or Unrealistic Responses

Inconsistent or unrealistic responses can take place on a number of different question types, including Numeric Open End and Single Select.

We advise researchers to think about unrealistic responses to questions when designing and programming the survey, as validation can be used to curb impractical responses. Here are some examples of spotting impractical responses. 

How many times have you gone for a run in the past 12 months? 

If a respondent answers 600 that answer is likely unrealistic because there are only 365 days in a year.

How many hours a week do you watch TV? 

If a respondent answers 75 hours a week, that answer is likely unrealistic because there are 168 hours in a week and typically 40 are spent working, 56 spent sleeping, etc. 

What is your birth year and age?

If a respondent is asked for their birth year as a single select question and the respondent selects “1988” and then at the end of the survey, we ask them their age, and they select “35-44” their data is inconsistent. 

What is your relationship status? Number of children in your home? Total number of people in your household?

If a respondent indicates that they are married and have three children, but then say they have three people in their household, their data is inconsistent.

When should you reconcile survey responses?

Overall, it is up to the researcher or subject matter expert to determine which responses are unrealistic and which respondents to remove from the dataset. We recommend reconciling any respondent who fails these checks. Although, if your study has a very low incidence rate, it may be worthwhile to toss out respondents who fail two or more checks, but stringently review their data if they fail only one. 

Lucid allows respondents to be reconciled from the date of the complete to the last day of the following month. However, if you do plan to reconcile, we suggest doing so as quickly as possible, as reconciling poor quality completes is advantageous to both you and our supply ecosystem.

Now that you have a more thorough understanding of data cleaning, we hope you’ll implement everything you’ve learned! If you have questions about how to clean your data, please talk to your Lucid representative or contact us for more information. 

Recommended Posts

Stay in the know with LUCID. Subscribe to our newsletter.