Data science projects aim to solve several business problems such as sales forecasting, price optimization, energy saving, customer retention, customer segmentation, anomaly detection, and more. In a project, Step 1 would be to define the problem as clearly as possible.
We can ask several questions in problem definition:
We would not be clear whether we can use machine learning methodologies in our solution unless we are clear with the problem definition. This is our starting point.
Next, we will focus on the features that we will use in our solution. There must be available data for us to use a feature in the model. Some of the questions we usually ask regarding data are:
These are basic questions of data quality and availability. If you are using machine learning software, you can quickly get answers to these questions if your software is equipped with data preprocessing tools. However, maybe the most important question we need to ask ourselves is: Do we know our data?
A data scientist is handed a problem and the data to solve that problem. She is -usually-not the data provider and has to rely on the information given to her about the data. To understand the data better, she will ask the business unit and the data provider various questions. Here, there are some concerns.
Do the data provider and the business unit comply with the definition of the data? Suppose that the data scientist is working on a problem for the marketing team of a life insurance company. One of the features of the solution is selected to be the “monthly premium production”. Suppose that data will be provided by the business intelligence team, who periodically reports a data table to the corporate sales team including a column named “monthly premium production”. BI specialists can assume that the required data is the same as the data he reports all the time. However, it is possible that two teams -marketing and corporate sales- may include different production components in their “monthly premium production” definition. One may include only the new sales or the other includes the sales made through limited channels. The difference in defining production components may be done according to the specific business needs of the two teams. If the difference is not an obvious one or for some observations, the unmatching components of the feature take zero values, BI specialists cannot understand the discrepancy and mistakenly provide an incorrect set of data.
How likely is such a mistake? In the absence of corporate data dictionaries, this is quite likely. If there is an ongoing transition in data producing, keeping, and reporting software, organizations will rely on their employees with high data familiarity and business knowledge, both in the business units and the BI teams. This is quite risky in case of high turnovers. It would be beneficial to have total compliance of business and reporting units on data definitions.
Is that data used before? Is it what we think it is? It would be beneficial to see a previous project that makes use of the provided data. Previous data usage can give us some insights despite being new to it. If this dataset or some of its components are not used before but are stored for possible future use, we must be cautious to be the first ones using them. Unused data is somewhat untested data.
Suppose, in an energy optimization problem data scientist is given data on gate sensors. In the initial setting-up of the sensor devices, electrical technicians put the sensors to some levels above ground to capture gate activity. Technicians open and close the gate as they see fit in the sensor installment phase. BI specialist checks whether the sensor produces a signal for open and closed states at the correct time. Everything seems working, sensors data is written to the database. The business unit assumes that underlying data shows the total time the gates were in open and closed states. However, sensor locations may not be fit to take all possible states into account. Users may be frequently using the gate “half-opened” which is recorded as the gate was signaled to get closed twice in a row without being recorded as opened since the open-state sensor was located too high to detect half-opened. Since nobody used this data before it would be the data scientist -if she can- first to discover that it only covers some information about the gate state. This data will be useless if there is no possible workaround. What if this feature was expected to be one of the major ones of the model but data scientists could not have detected the problem in data collection?
We cannot have a correct model with incorrect data. The best way to use correct data is to know our data. It would be most useful if we have a data dictionary that the business units and the data providers are compliant with. A dictionary would be linking past and previous projects of an organization if it also contains the info of the earlier uses of data. It would save us time in the data gathering phase of our machine learning solution.