A Noble Quest: Creating a Predictive Model with Health Data - Part 1

The world loves machine learning. What’s not to like about a model that can ‘magically’ give me the perfect show to watch or predict the next few words I’m about to type? If applications can make decisions based on my data, I figured why can’t I do the same? From this realization, I began to wonder, could I create a model using my Apple Health data? Time to find out and blog about my journey!

I’ve probably been playing too many video games, but as I went through the data modeling process, I realized that it follows a similar flow to an RPG. I thought I could help demystify the process by equating my data exploration and modeling after a video game. Every good RPG needs some sort of overarching quest to follow. Time to power on the (python) console and get started.

As for any great RPG, there needs to be a background story. In this case,I had a problem to solve: what trends exist in fitness and health data that my dataset could detect? After some internet searching I came across an article from the Sleep Foundation. The article claims that working out can lessen the time you are awake during your sleep cycles[i]. Dramatic music starts playing. I’ve found my quest! However, I can’t start my quest without first building my character. In this case the character would be my dataset. 

My Main Quest:

Find out if my data reflects the claim and if I can predict if I’ll have a good or bad night sleep based on my workout split for the day.

My DataSet:
  • Import Apple Health xml file using Element Tree. Extract workout and sleep related data from the file.
  • CSV file of my gym check ins

Now that I’ve built the base of my character, I can start to explore the world. I have my main quest, but the world is vast and side quests are already popping up. I should pursue some of them to level up before I go on my main quest. In many open world RPG games, side quests and exploring take up most of your play time and your decisions here can impact you later in the game. Similarly, If you talk to a data scientist they will usually tell you that data cleaning and exploration take about 80% of the entire model creation process. Furthermore, in the data exploration and cleaning process you must make judgments about how you want your final data set to look while considering the impacts that could have on your model.

In the case of my data set, I will level it up by performing data cleaning and exploration. I have some basic frameworks to follow to do this, but as I work through them I may start to notice other patterns and leads I want to follow. But for now, let’s start with some basic data cleaning tasks.

Side Quest 1: Clean Workout Data – Determine Relevant Data

To start, I need to get an understanding of my data. What kind of workouts are in my dataset and what’s the frequency? I will look at my summary workout counts by type.

Screenshot 2023-11-29 135855

I see that there are activities that I’ve only recorded a few times or are seasonal. Since these aren’t workouts I do day to day, I want to exclude them from this model, leaving me with FunctionalStrengthTraining, walking, yoga, TraditionalStrengthTraining, and tennis.

I then want to rename these to give a better understanding of what these activities were. Using my gym check in file I can assign if FunctionalStrengthTraining was a gym or a workout studio (PSF) visit. I perform these transformations and cue the side quest complete music.

Side Quest 2: Clean Workout Data – Check for Data Inconsistencies

Next, I will look for any values that seem off in my data, starting by looking at summary statistics.

Screenshot 2023-11-29 140008

I can’t imagine having a great workout for 1 or 3 minutes, so I’ll remove those values. I also notice NAs in my data for days when I did not perform a workout. I want to set these to zero since it means I didn’t perform that workout that day.

Next, I check the dates included in my dataset. I didn’t get my Apple Watch until February 2023 but I have dates before then. I will remove those. Since I don’t have a full month of October I will cut off the dates at 9/30/2023.

Out[51]: Timestamp('2020-01-05 11:43:47-0600', tz='pytz.FixedOffset(-360)')

Out[52]: Timestamp('2023-10-12 10:53:19-0600', tz='pytz.FixedOffset(-360)')

My workout data is looking good. Side quest complete!

Side Quest 4: Clean Sleep Data

I am also using sleep data that was tracked by my watch. I need to do some work with dates here. The times for my sleep records span overnight, meaning there are two dates listed for a sleep session. I want to tag the night I slept as the date. It’s safe to assume that if the hour on the date is later than 8 pm it will be assigned the current date. If it’s before 12 pm it’ll be assigned as the previous day. 

I’m choosing to filter to only weekdays as I don’t wear my Apple Watch to bed often on the weekends. For weekdays where I didn’t wear my watch to bed, I am using forward fill to assign data to those days. Side quest complete!

Side Quest 5: Explore Sleep and Workout Data – Check Trends

Now that I have my clean workout data, I need to get a better understanding of the metrics. This will help me with my variable and model selection. I aggregated the data points to see if there are any trends in my data. I can’t see much in respect to patterns of my workout splits. However, I do see workout times and time awake following a similar trend. Side quest complete!


Side Quest 6: Explore Workout Data – Check Distributions

I also want to see the distribution of workout durations. My data has a lot of zeros which are impacting the boxplots so I will also look at the histograms to get a better idea of distributions. Side quest complete!

Sidequest 7: Explore Sleep and Workout Data – Check Correlations

I am curious if other sleep cycles could be impacted by my workouts. I also want to see if certain workouts have a stronger correlation than others. I will create a correlation matrix heatmap to visualize this.Correlations are on the low side with isAwake having the highest.. Sidequest complete!

While there’s endless data exploration to do, we will stop here. I’ve leveled up my data enough to prepare for the main quest. I can always come back later if, along the way, I realize I need more experience. Time to review and ensure I am properly equipped to take on my data model. What has all this exploration taught me?

  1. Besides PSF, which has consistent durations, my workout durations are rather random. It’s possible PSF will not give much variability to the model.
  2.  My daily workout duration distributions are bimodal with a high frequency of zeros. Besides PSF, the distributions of the non-zeros are skewed. We will take this into consideration when choosing our model type.
  3. Total workout time and total time awake do follow a similar trend, so there must be some sort of variables that are impacting them both.
  4. Pct Awake aligns with Total Time Awake, implying that the spikes in my time awake were also spikes in the percent of my sleep cycle that was spent in the awake phase.
  5. isAwake has the highest correlations with workout types, although overall correlations are on the low side.

Putting this all together, I can confirm that ‘isAwake’ is the sleep stage I want to look at, since it shows the strongest correlation with my workouts. The amount of time awake is small, so instead of trying to predict the time awake, I will predict if I’ll have a good or bad night’s sleep. On average, a person spends 9-11% of their sleep time in the awake phase[ii]. I will classify anything that falls above that average as a bad night’s sleep and at or below average will be a good night's sleep. The next step is selecting a suitable predictive algorithm for my data model. In that case I have some options for model types. I want to consider three: logistic, random forest, and boosted trees. 

Finally, I am not confident this will be a strong predictive model due to the weak correlations and the lack of trends. We’ve made it this far though, let’s see how it goes. With my decided character build and experience I can go on to the main quest, trying to predict if I’ll have a good night’s sleep based on my workout split…. In my next blog post. *Cue cutscene*

Coming Up

In Part 2, we're diving deep into the world of model creation and evaluation. W will explore three powerful models, unravel the mysteries of accuracy and precision, and unveil the art of model tuning. Join us as we reflect on our creation and plan for future improvements. 


Author Bio

Jen Brousseau is a business analyst at Jahnel Group, Inc., a custom software development firm based in Schenectady, NY. Jahnel Group is a consulting firm that specializes in helping companies leverage technology to improve their business operations. We provide end-to-end strategic consulting and deployment services, helping companies optimize their operations and reduce costs through the use of technology. Jahnel Group is an Advanced Tier AWS Services Partner, with expertise in AWS Lambdas, Amazon API Gateway, Amazon DynamoDB, and other AWS services.


[i] How can exercise affect sleep?. Sleep Foundation. (2023, October 11). https://www.sleepfoundation.org/physical-activity/exercise-and-sleep#:~:text=Specifically%2C%20moderate%20to%20vigorous%20exercise,in%20bed%20during%20the%20night.

[ii] Whoop. (2020, August 28). Average time in each of the 4 sleep stages [+chart overview]. WHOOP. https://www.whoop.com/us/en/thelocker/average-sleep-stages-time/

Unlock Your Future!

Send us a message