One of the purposes of the Data Science group which I run is to try in practice a real Data Science project from the setting a problem to the providing a working solution ready to production. The tricky side of such projects in general is that they involve skills from different areas. Even if you focus on data engineering role, it involves presentation skills and domain knowledge along with programming and applied statistic skills. This is complex and you need to deal with a lot of information and, ultimately, it's too much for an easy start. Many people drop the idea to make the career change because of this. But if you still feel the power to try something new, I would appreciate if you join our group where I prepared a plan which covers every important aspect and especially useful for future Data Engineers.
Soon in the group we start a new session and you can choose a track from the list mentioned here. The activity of the group will be oriented to learning necessary topics on your own and having meetups once a month. During a month we can also gather and work together on topics, discussing obstacles and sorting out some issues. You can imagine a project which you're participating, but in a background way. During our way we will decide about the actual steps to go through out the plan which you can find below.
Whatever track you chose, we should move together as a group. This is why the main stages are important for synchronization. I'll call it “modules”. And within each module we'll have some sections which are more likely to be changed or omitted if we decide so.
Module Structure
Each module represents one problem which could be a part of a real project. At the same time, sections represent steps to find a solution for the problem. In the beginning of each section we will have an introduction overview of the steps and the goals which we're going to achieve. In the end of each module we'll have one or more colloquiums to share our findings and discuss obstacles and train presentation skill.
Data Preparation Module
There is a rule in Data Science “bad data in, bad data out”. It means however advanced your statistical model may be, it will produce bad results if you train and use bad data. Good data is a key factor of success in Data Science projects and usually up to 80% of all time of the project spent on the activity to obtain a correct data set, clear errors and so on. At this stage programming skill and domain knowledge play key role. We'll touch basics of Python, pandas, and other packages to work with data, to visualize correlation of features and so on.
Model Selection Module
There is no a silver bullet in Data Science. There is no some advanced model which would give good result all the time. In some cases you need to use several models in conjunction to produce quality results. At this stage we will take an overview on different models, their validation and choosing. This stage mostly relies on your knowledge in applied statistic and presentation skills. We'll try different models, Python packages to work with them and prepare a presentation which purpose is to show why a specific model is a good one for the problem.
machine learning Pipeline Module
There is no real project where once produced solution would stay like this forever. The Data Science projects are not an exclusion. The model which we found on the previous stage and the approach of preparing data which we found on the first stage now should be connected together in a pipeline. This pipeline will use constantly updated data set to update the model and upload it somewhere where it can be used by others. This stage heavily relies on your skills as an engineer and programmer. We'll try Python packages to build ETL pipeline, such as Apache Airflow and automate all phases of producing of a new model.
What's Next?
We'll move from the simple tasks to more complicated. Each stage will get an increase of complexity. For example, will touch basics of neural network and use existing tools, we'll build pipeline with more advanced tools. There is a limitation though in resources and we're as a group will decide about next steps. There is no dates for finishing each of the module.