Introduction

Data Quality Podcast
===

[00:00:00] Hello and welcome to the very first episode of the Data Quality Podcast. My name is Denis Gontcharov and I will be hosting this show in future episodes as well. I wanted to create a first episode that explains what the show will be about and what you as a listener can expect from future episodes. In that sense, this will really be more of a short.

Introduction right then. So I started this podcast to talk more about my experience working as a data analyst, but also data engineer with data intensive applications in the industrial sphere. So in this podcast, I will mostly talk about. Transferring SCADA data to Cloud SCADA is a core concept in manufacturing systems.

It stands for some supervisory control [00:01:00] and data acquisition, and there's currently a very big movement going on, especially in enterprise scale industrials of transporting this data to the cloud. And in my experience very often, Microsoft Azure. I will also talk a lot about Databricks. A common tool used for big data analytics on distributed computing.

And finally, I wanted to mention that unlike in this episode, most of the speaking will not be done just by me, but I will invite guests over from my network with whom I have worked, people I know from the industry to share their experience about these topics. In addition to talking about common technologies, I will also focus on very core themes adjacent to data science and data analysis and AI, things such as data [00:02:00] infrastructure.

For example, the Unified Namespace is currently making a big splash in the manufacturing industry, but also topics such as data quality where I can dive into new tools such as Soda Core or great expectations. I realized that I'll probably not be able to capture an exhaustive list at this point in time, so it's quite likely that the subjects over the coming months when I do this podcast will fluctuate.

Now, as for the second part of this episode, I think it makes a lot of sense to focus a bit more on why you should even listen to me, and I hope you will find the story of how I got into data compelling. In fact, I did not set out to be a data professional when I began my career. In fact, even when I began my uni. Education.

I studied materials engineering in Belgium at the University of Lovin, where I [00:03:00] was very focused on. The production and recycling of non ferous metals. I was always very passionate about chemistry and physics and the transformation of materials. But then in my first job, which was in an aluminum production plant, also known as an aluminum smelter.

That essentially produces liquid pure aluminum from alumina via the Hall and Heroult process. That's where I first got in contact with the whole field of data and data analytics for that. In that internship, I namely had to apply a lot of statistics and data analysis, and for this, I used the programing language known as R, like the letter R.

Which I actually learned at university in Grenoble when I was doing my exchange in my final year. So this is really the story of how I got involved in data. I was essentially querying databases by writing [00:04:00] SQL queries the whole time, and then making beautiful graphs with art, which are art is very good at, and essentially this planted the nuggets of what would be my passion for data. From then on, I was then hired by that company for about one year and a half that I worked there. In Germany, it was in Essen. But very quickly I realized that instead of programming process control systems for the aluminum industry, I was a lot more passionate about analyzing data with Python. At that point, I was also experimenting with machine learning on Kaggle, which was a data science competition website. I think it's still pretty active, but the community died down. I also constructed my own workstation for GPU training of deep learning, which at that time was in full swing.

I'm talking about the 2018 where you had like the visual model models, the [00:05:00] CovNet models and so on, and this all made me eventually abandon my career in a materials world and go for a data science position in Frankfurt, the data science consultancy firm called Statworx. It was a good experience indeed. I managed to deepen my Python skills, work with interesting colleagues, but very quickly I realized that I wanted to be self-employed.

And at that time I was lucky that one of my former colleagues at TRIMET required my help for a data analysis app, because he saw my work in R and he needed something that allowed him to analyze data visually. And because he didn't want to spend a lot of time building it himself, he hired me to build it for him, and that was really my first self-employment assignment that I got alongside my job.

And I really enjoyed his experience of building something and getting paid for a result, basically on my own time, on my own schedule. [00:06:00] The freedom felt very liberating, and I've been trying to be self-employed ever since. So in fact, I already quit my data science job in six months during my probation period, by my myself, and went out to work as a freelancer, or better to say as a contractor.

At that time, I was move living in Frankfurt in Germany. So I moved back to Belgium to my hometown of Antwerpen and opened a LLC that was 2021 where I essentially tried to find clients for R development, which was of course very difficult. I essentially jumped into freelancing head on without knowing how to do it.

I had no idea about running a business or do marketing. I thought I would just be developing R Code. I was actually quite lucky in the sense that there was a different tool that was coming up at that time. It was Apache Airflow.

Apache Airflow is a tool for workflow orchestration that's commonly [00:07:00] used to schedule data pipelines for batch processing. And I used to be quite fascinated by it and I made some blogs about it and by accident I found a job that required someone to build data workflows in Apache Airflow, which was something I could do.

So I applied and got the job as a freelancer and completed a project for that particular startup, which was, let's say, my second self-employment job. And I've been doing that for two years. Eventually, I landed a pretty big contract at Johnson and Johnson, which was away from manufacturing. But there I worked on a data science framework for machine learning called Kedro, developed by Quantum Black, which was a consultancy in London.

I eventually switched from that project back to full-time employment. More specifically, I was [00:08:00] headhunted or recruited by a recruiting agency in Germany who were looking for a data engineer who could help Novelli, which was a world leader in aluminum rolling and recycling to set up their data pipelines.

As a data engineer, and this job really caught my attention, even though it was a permanent employment instead of a freelance contract. It was in aluminum. Aluminum, my previous field of interest, and also it was in data. And to me, it seemed that it combined both my experiences, which propelled me to focus more on, but try to combine them and see if there's a merit in that. So I closed my business relocated to Berlin, and that's where I, for nearly two years worked on that company. I quickly realized that was a mistake, that giving up self-employment was a bad decision, at least for me, [00:09:00] which is why I have left that job at, in the middle of 2023 and started to work for myself again.

I wanted to see if I can combine my experience in industry or manufacturing with aluminum and with data, and quickly settled on a trend which was coming up from, I would say the beginnings of 2020 called the Unified Namespace, popularized by Walker Reynolds, and try to find clients via the united Manufacturing Hub, which was a startup in Cologne in Germany that developed digital infrastructure based on an event driven architecture for manufacturers.

It evolved technologies such as MQTT, HiveMQ TimescaleDB, Node-RED. Kafka or Redpanda, in essence, modern IT solutions for an age old industry like manufacturing. And that's where I'm [00:10:00] currently at. I do have to say that lately I have been more interested in coming back more into the traditional enterprise data engineering field because I see a big movement of large companies trying to move data from their SCADA systems, which essentially contains a massive amount of time series to the cloud. And I think my experience in industry and my knowledge of these technologies can be a big asset there, which is why currently I am more focused on Databricks and Microsoft Azure.

So that's in a nutshell about me. My first six years of my career were quite a turbulent, if I have to say so myself. I don't think I've ever worked somewhere longer than 16 months, which was my first job. Nevertheless, I felt I had a, always a very strong learning curve and that I've always essentially stayed in the field of data for six years.

So even though I was doing a bit of data engineering in one job, then some machine [00:11:00] model training in on a different job. It was always related to processing data. And the main thing I see, and the main reason why I wanna focus on data quality is that the main culprit in any objective we try to achieve was bad data.

We talk about machine learning , today we talk about AI, but no one ever talks about the messy data it has to be built with. So I think data is a very big, data quality is a very big problem, which is why we also have movements like data centric AI that I don't want to get in here. But this is essentially what my focus will be for this podcast and for future episodes.

I think that's it for now. It's a bit of a messy first episode if I say so myself. The main thing I wanted to confer to you, my dear listener, is the topic of this podcast and the reason why you should listen to me. So without taking more of your time, [00:12:00] let me say goodbye for now and I really hope to welcome you in the future episode where it won't be me, but one of my dear guests who will be doing the speaking.

So that's it for now. Thank you very much for listening to the Data Quality Podcast, and I'll see you in the next episode. Bye-bye.

Introduction
Broadcast by