Kubeflow Series 4: The Art of Data Engineering: Extract, Analyze, and Prepare!

Previously on Evil Tux

Our last post explored the importance of scoping and defining a problem before diving into any machine-learning project. We discussed how asking the right questions—like “Is the gold worth the effort?”—is essential to ensure that your ML project will deliver meaningful results. We also touched on the model development lifecycle, which takes raw data from the “mines to the market” to solve real-world problems. Kubeflow supports this journey by simplifying the model development process, but teams must first define the problem they want to solve before using it. Now that we’ve laid the groundwork, it’s time to dive into the next stage: preparing data for your ML models. This journey begins with extracting your data from the environments around you, then moves on to analyze your data, and ends with preparing your data. Let’s dive in!

The Data Extraction Stage

This is where the data wrangling begins. Data extraction is where we start to gather data from the real world. This data could be in accounting spreadsheets or from high-end “BioData Harvesters” such as flow cytometers in the end, we need that data! You might have heard the saying, “Data is the new oil.”  Well, that rings especially true here. However, like drilling for oil (or mining gold), extracting data comes with challenges and considerations:

  • Access to data: Do we have the right tools to strike rich veins of high-quality information? Securing these crucial resources, whether buried deep in proprietary databases or scattered across public sources, is essential for our model’s success.
  • Data gathering methods: How are we panning for this gold? Are we relying on established sources (existing datasets), negotiating deals with domain experts (data partnerships), or deploying our prospectors (custom data collection) to extract valuable insights firsthand?
  • Volume of data: Do we have enough raw material to fill our bags? Is our dataset extensive and diverse enough to cover a wide range of scenarios, ensuring our model can be generalized effectively? Without enough balanced data, we risk ending up with a model that looks shiny but crumbles under pressure. We may need to refine our data using techniques like SMOTE (Synthetic Minority Oversampling Technique) or other methods to balance our haul.
  • Data freshness: How recent is our strike? The value of our insights depends on how current the data is. If we’re working with outdated information, it’s like trying to mine a claim that’s already been picked clean.
  • Data format and accessibility: Is our gold ready for minting or trapped in the rough? Ensuring our data is properly structured and easily accessible is crucial for a smooth journey through the model development stages, from raw nuggets to polished outputs.

In the end, we need to gather the data and store it so the following users can use it as they expect, which leads to a lot of complexity. Extracting data takes a lot of time and effort. So much time and effort, we don’t want to bog down this blog with the details, but, trust me, extracting data is an art, and the data never comes out as clean as we’d like.  Luckily for us, we have data engineers who deserve a consenting hug and a handshake. They’ve earned it! Now, after we’ve extracted the data through the magic of data engineering, it’s time to analyze it.

Data Analysis Stage

This stage is a deep exploration of our dataset’s core, where we dig deep into the heart of our data to uncover the valuable features that will guide our model’s learning process. Just like a seasoned prospector evaluating a potential strike, we must ask critical questions that shape the journey ahead:

  • Data relevance: Does our dataset contain the right mix of “gold nuggets” to train our model effectively? Just like we can’t repeatedly focus on the same type of gold ore, we need a diverse dataset to reflect the real-world variety.
  • Feature importance: Which parts of our dataset hold the most value, like identifying the richest veins in the mine? Determining key features—patterns, textures, or other characteristics—is vital. 
  • Data cleaning: What impurities or debris need to be sifted out? Just as a miner removes dirt and gravel to focus on pure gold, we must eliminate anomalies, duplicates, and inconsistencies. Poor labeling or bad data can be the “fool’s gold” in our pipeline, leading us astray. Labeling guides are like blueprints for the refining process—essential to ensure every nugget counts and doesn’t introduce bias into our final product. Now to clarify, “bad data” is a complex topic and is not as simple as wrong or inaccurate since determining something as “wrong” is relative to the problem you are solving. Another way of saying this is “bad data” is context-dependent, meaning it’s data that doesn’t align well with the specific problem or model at hand, leading to poor or misleading outcomes, even if the data itself might be accurate or high-quality in a different context.
  • Feature engineering: Are there hidden gems that require refining? Sometimes, the most valuable features aren’t apparent at first glance. This step is like extracting gold from ore that doesn’t sparkle on the surface. We may need to develop or extract specific patterns, behaviors, or signals that greatly enhance our model’s learning, ensuring it can separate the gold from the grave.

Recently, I attended a Denver MLOps meetup where we discussed the pains of feature engineering and model training from the perspective of Tortuga AgTech. This Denver-based company designs and operates the world’s largest fleet of autonomous harvesting robots. One problem they come across is finding a quality fruit to pick. Solving this problem meant training the model on all sorts of datasets and evaluating the quality of predictions that determine what ripe fruit they will be paid for and what is worth leaving on the vine! It is a complicated task, but it shows the importance of finding varied datasets and analyzing their impact. We may need to gather more data before moving onto the data preparation stage or revert to the extraction step if we fail to get the desired outcomes! Still, Tortuga managed to do it using focused models and fast iteration (part of the seemingly growing  “topic for another time” list). Data quality and determining what insights we can derive from data are the key takeaways from this step. After we’ve analyzed the data for quality and gaps, we can begin to prepare our data!

Data Preparation Stage

In the lifecycle of a machine learning project, this stage is where the raw dataset is refined and processed into a structured format ready for model training. Just as gold must be separated from ore and impurities, our data undergoes a crucial transformation. Here’s the process:

  • Dataset division: Our treasure trove of data is meticulously divided into different “claims”: training, validation, and test sets. The training set is our primary mine, rich in examples that the model will sift through to learn patterns. The validation set is our quality checkpoint, helping fine-tune our process and preventing “over-polishing” (overfitting). Finally, the test set is like a new claim—an unseen territory where we assess how well our refined model handles fresh challenges.
  • Feature selection and cleaning: Within the vast expanse of raw data lie the nuggets with real value—specific features crucial for accurate predictions. This stage involves sifting out impurities, such as mislabeled entries or irrelevant features, leaving behind a dataset rich with valuable characteristics. If deep learning is our approach, the model may automatically detect these features, but this process remains vital in many machine learning workflows.
  • Scaling and encoding: Just as gold ore needs to be processed consistently, our data must be standardized. Scaling adjusts varying data points to a standard format, ensuring the model interprets them uniformly. Additionally, encoding transforms categorical data into a form the model can process, like converting raw gold into a usable currency.

In essence, once we’ve explored data, we need to process it and divide it up for teams to use. This data must also be versioned and stored so we can easily access it, run jobs on it, and ensure reproducible and replicable results. We won’t dive into the concepts of reproducibility and replicability in depth (yet), but below, I’ve got a nice graphic that may help.

On one side of the picture reproducibility is defined as testing the same dataset of images to achieve the same accuracy and performance metrics. The idea of reproducibility is represented with a filter filled with a gray chicken and gray ducks funneling into a duck classifier model that separates the ducks and not duck, or chicken. The other side defines replicability as testing the model with new, unseen datasets and expecting similar performance levels. Replicability is represented by a purple chicken and colorful ducks traveling through a filter funneling to a duck classifier model, which places the purple chicken in the "not duck" category and the colorful ducks into the "duck" category.
Reproducibility vs Replicability

This stage aims to prepare our data for training and validation so we can create and iterate on models reproducibly.

What’s Next

In this blog, we delved into the critical stages of the machine learning lifecycle, from data extraction and analysis to preparation and beyond. Each step is vital, requiring care and precision to turn raw data into a valuable resource that powers our models. Whether it’s curating the right data, training models to make informed predictions, or ensuring they perform consistently in real-world applications, each stage in the process lays the foundation for success.

Next up, we’ll explore the Model Training Stage, where all our preparation pays off, and we start transforming raw data into something valuable. We’ll discuss how tools like PyTorch guide our models through the learning process and why a data-centric approach to AI helps refine and focus the learning process. This is where we begin to turn data into actionable insights and solutions that can scale.

About the Author

Chase Christensen is a machine learning solutions engineer who lives at the intersection of business value and technical execution. He’s not just here to talk about what’s possible—he’s focused on making it real. With a background in open source and a hands-on approach, Chase works alongside teams to connect real-world problems to the right ML tools and infrastructure. He’s fluent in both the boardroom and the terminal, but knows that business problems are easy to spot—it’s delivering the solution that counts.

You might also like: