In any kinds of Data Analysis preparation, the maximum amount of time is being spent on choosing the data set, cleaning it and validating the data.
Three stages for data preparation are as under-
Choosing the data set– Identifying the factors and the period on which the information will be prepared, tried and approved.
Cleaning of the data set – Once the variable and period decided, you have to clean the informational index for any anomalies it may have. Expelling exceptions, treating missing qualities are a portion of the regular tasks performed at this stage.
Adding meaningful variables– Adding increasingly important factors can add extreme value to your analysis.
Mistakes did during choosing the dataset
Past data not available accurately – Since the data analysis is performed on the past data and not on the data present now, variables should also point to the past. For example, if the person was living in the village before and recently moved to the city, the variable should point out as the village, not the city.
Data Collection – Data collection is done seeing an only positive response, for example, suppose you get 100 applications for issuing the visiting card and out of it, you remove 60. You entered the information for 40 and issued them cards. When you must make a better model for applications, it will require the factors for every one of the 100 applications to make a great model.
Mistakes did during cleaning of the dataset
Not removing duplicates – On the off chance that your informational collection contains copy records, they should be opt out before you play out any examination on them. By not doing as such, you wind up giving additional weight to it. The precedent referenced toward the beginning of this article is a unique instance of this blunder.
Not treating zero, invalid and unique qualities cautiously: Treating these qualities can make a huge difference to your model. For example, Excel incorporates missing values by zero, like 01/01/1700 on a date.
Mistakes did during transforming the data
Including ID as a variable: now and again individuals utilize numeric ids as a contribution to their factors. Each Organization and framework have its own specific manner to make ids. Utilizing them indiscriminately in displaying will finish up giving peculiar outcomes.
Not being theory driven in making determined/changed factors: You must make significant factors premise business comprehension and speculation. Simply experimenting with new factors without having a theory at the top of the priority list will finish up devouring a ton of time without giving any important increases.
Not investing enough energy considering changes: Since information cleaning takes a great deal of time, examiners are normally depleted when they achieve this stage. Henceforth, they will, in general, proceed onward without investing enough energy contemplating new conceivable factors. As referenced, this may finish up harming the nature of your model altogether. Extraordinary compared to another approach to alleviate this is, to begin with, a crisp personality. Along these lines, on the off chance that you have completed every one of the information cleanings, enjoy a reprieve, give a gesture of congratulations and begin contemplating changes/determined factors from the following day. You will see some quality factors turning out on the off chance that you invest energy at this stage.