How to store Data for your Data Science Process
Learn how to develop an effective data storing strategy…
Data — the new oil :
We all know that in the modern world ‘Data is the new oil’. The field of data science is also the most over hyped field and everyone wants to just jump right inside it because of some common buzzwords. The main issue in this regard is to find a mature approach to follow the data science process.
In this article, I am going to cover the most fundamental and first step of a data science process i.e Data Storing.
Data Storing in a data science process refers to storing of useful data which you may use in your data science process to dig the actionable insights out of it. Data Storing in data science itself is an orderly process which needs many things to be kept in consideration before jumping to more advanced or fancy things.
Below we are going to discuss those steps in detail. Please follow along with me :
1. Identify your goals :
First of all if you want to store data for doing data science, the foremost task for you is to have a clear strategy for data saving.
As identification of goals is the first step in process of data storage, this should be prioritized by you as a Data Scientist because all following steps will depend on this.
There’s a pattern for goal setting in the software industry known as OKR — Objectives and Key Results. It was introduced in 2004 by famous venture capitalist John Doerr to Google’s founding team and rest is the history. Most of tech giants like Amazon, Zalando and Intel use OKRs till today for goal setting.
OKRs give us a clear strategy to only measure those actionable things which matter most to us. Prioritizing things may depend on revenue, scope and your resources. According to OKR strategy, Objectives should be clear, concise, concrete, action-oriented and ideally inspirational. Same goes for your goals in the data science process. You should always choose goals which are actionable and practical.
Measuring KPIs — Key Performance Indicators may be a good point for you to start. KPIs have been there from years to make important business decisions. For example, if you have some job application management website, you may be interested in the number of candidates which applied in last month, how may got interviewed, hired or rejected etc. You may then allocate appropriate resources to handle this user base. Actually, KPIs are measurable values which are decisive in achieving your business objectives.
2. Big data or small data :
Next thing after you have clear cut goal is to decide which type of data you need. This decision totally depends on your goal and on your resources.
Big data is normally the data which needs to be stored on different servers and it’s coming out from multiple sources. It may be from sources which are continuously generating huge data. It has a lot of noise and is unstructured normally.
Small data is traditional data which is structured, stored in databases usually by us and you have full control over it.
For instance consider you are in a large organization which works on some targeted digital marketing. Now if you want to better segment the users then it is obvious that you will need big data. It may involve storing some massive social media statistics, machine data, or customer transactions every day.
On the other hand if you are working on small scale project or a part of big data to see behavior of your data over-all then you will be dealing with small or traditional data.
Resource allocation is important in this step, because you may need additional servers to store Big data. Also, in the coming steps of data science process you will need special tools to deal with it. So, you should definitely keep all these things in mind before moving on.
3. Avoid data fatigue :
Data fatigue refers to over storage or measurement of data which is useless for you or doesn’t align with your data collection goals. The most common problem for today’s Data Scientists is noisy or incorrect data.
To address this issue properly, one needs to focus on needs of specific problem he/she is solving and then collect the data accordingly. A Data Scientist can also use some tricks to store data to reduce the size of data.
For instance, if you need latitude and longitude of a place, then you can store this data in the form of geocodes. Geocodes can be decoded using some basic packages in Python or R. This can significantly reduce the size of your data.
The common steps for avoiding data fatigue are :
I. Don’t forget to ask around for existing processes :
Each company which works on data has some process of managing data. So, it is always good to lookout for existing process followed in the company. Starting from scratch is difficult. Study those existing processes and find out the ways to improve them.
II. Stop thinking about objectives which are not actionable :
If you are obvious about a goal that it is not needed then don’t waste your time cramming on it. As a Data Scientist, it is your responsibility to find out ahead of time what should be your objectives in terms of strategy, innovation and execution.
III. Don’t expect your data storage mechanism to be perfect :
There is always a room for improvement in any process. Data storage process follows the same principle. Never blindly expect your process to do all the things for you. Keep a human at loop at times so that they may exercise their intuition which can lead to more improvements.
IV. Don’t work in isolation :
Never work in departments. Keep your Database Administrators on-board with you. They may help you in architecture related things. Plus you can help them by letting them know about your needs as a Data Scientist.
V. Learn difference between intelligent filtering and correlation :
Statisticians say that correlation doesn’t imply causation and they are not wrong. If you have heard about revenue performance of a competitor by certain things then it is not necessary that the same process may work for you. You will at last need to use your own wit to know what your needs are and what data you will need to meet them.
4. Data management :
The final thing which you have to deal during the process of data storage is to whether use SQL based databases or Non SQL ones. Both have their own advantages and disadvantages and are made to deal with particular applications.
To make our decision easy, let us limit our discussion to MongoDB and MySQL which are spearheads of both types of DBMS. First of all make it clear to you that both MongoDB and My SQL cannot suite all types of applications in Data Science.
If we are collecting data which is semi structured or unstructured, then MongoDB should be used. It is because complex queries like Joins are slower in MySQL, in this case. Mostly we have this situation in Big Data, where speed of processing is also our primary concern.
However, if your data is highly structured or you already know the ins and outs of your data, then MySQL may be the best option. It is because you can manipulate the data very well and changes can be made relatively quickly in a structured data using SQL compared to a NoSQL platform like MongoDB for structured data.
There are also other Non SQL and SQL alternatives which may be more suitable for applications you are working on. So, it is good to check which may cater you needs.
For instance, MariaDB offers more and better storage engines as compared to other relational databases. Using Cassandra, NoSQL support is also available in MariaDB, enabling you to run SQL and SQL in a single database system. MariaDB also supports TokuDB, which can handle big data for large organizations and corporate users.
Did you enjoy the article ?
If your answer is yes, hit the Clap button 👏 as much as you can and follow me here (on Medium). You can also follow me on Linkedin or connect with me there.
If it is a no, then you can still hit the clap button but drop a comment here to let me know what I can improve in future for you.
My name is Saeed Ahmad and I am a Data Scientist, Machine Learning Engineer and Full Stack Developer. I work on JavaScript, Python and all the related stuff. I love to write about new technologies and trends in the industry.
So, if you have something to discuss with me regarding Data Science, Machine Learning or Full Stack Development, then connect with me. I would be very happy to help you.