How does my smartphone know what am I doing? - Using Convolutional Neural Networks for Human Activity Recognition with inertial sensors and PyTorch
Since the iPhone was presented back in 2007 until now, we got somehow used to all the amazing features our smart devices are capable of. Many of those "magical" functionalities are now possible thanks to the explosion of MEMS in consumer electronics. To be fair, the work of thousands of engineers at Apple, Google, Bosch and a couple of other high tech companies is the reason for us to get that "magical" feeling once in a while when we try to understand "how the hell is this possible?"
During the last years, a very hot research topic has been making intelligent use of the MEMS sensors which are widely available in every smartphone, tablet, wearable, AR/VR device, drone and IoT device (Jindong Wang et al 2017). Some of the applications include recognizing activities of daily living and sports activities, sleep patterns, respiration, health and disease issues, between others. This article will focus on recognizing Activities of Daily Living (ADL), with inertial measurements from smartphones using the PyTorch framework and Convolutional Neural Networks (CNN).
HAR Datasets
If you ever worked with an algorithm developer, you will know that the first thing they say even before their name is: "GIVE ME THE DATA". You may try to be friendly and have some informal chit-chat with them, but at some point they will abruptly stop you and insist: "NO DATA, NO ALGO!". This is even more true if that algorithm developer is 2 meters tall and comes from Hungary.
Fortunately for us, there are hundreds of public datasets readily available to do any kind of experiment and therefore we don't always need to invest any effort in data acquisition, which is expensive and time consuming. Even in the Human Activity Recognition (HAR) area alone there are dozens of public datasets.
In this article we are using the Human Activity Recognition using Smartphone Data Set. This dataset was performed with volunteers wearing a smartphone (Samsung Galaxy S II) on the waist. The dataset has the following characteristics:
- Number of subjects: 30
- Number of activities: 6 (walking, walking upstairs, walking downstairs, sitting, standing and laying)
- Sensors: 2 (3-axis accelerometer and 3-axis gyroscope)
- Sampling rate: 50Hz
- Total number of samples: 11.864.448 (10.299 registers * 128 samples/register * 9 channels)
- Partition train/test: 70%/30%
The recordings have in total 9 channels, which include the data from the accelerometer (X, Y and Z channels) and gyroscope (X, Y and Z channels) as well as the estimated body acceleration (X, Y and Z channels), which is calculated by removing from the accelerometer data the gravitational force obtained with a low pass filter (0.3Hz).
Here we have a video recorded during the experiments for creating the dataset:
Dataset acknowledgements: Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013
Convolutional Neural Networks for Human Activity Recognition
Conventional Pattern Recognition approaches in Human Activity Recognition (HAR) have shown successful results (Bulling et al., 2014), however this involves high level of human effort and its potential for recognising high-complex activities is limited by their nature. Hammerla et al., 2016 has shown through thousands of experiments with public HAR datasets that Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) are the most efficient at recognizing short time activities with natural order, while Convolutional Neural Networks (CNN) are better with long term and repetitive activities. Considering that our dataset is composed of long and repetitive activities (e.g. walking), we are going to use a CNN.
We will use the 9 channels available natively in the dataset as inputs with the also available clipped length of 128 samples per register. This means each input has 9x128 values. As usual with CNNs, we will layer by layer increase the number of channels (9 -> 32 -> 64), decrease the number of values per channel (128 -> 64 -> 32) and finally pass it through several fully connected layers with an output predicting the desired 6 activity types. In summary:
Input -> Convolution -> Pooling -> Convolution -> Pooling -> Dense -> Dense -> Dense -> Output
From theory to practice and experimental results
We have prepared a Jupyter notebook in Kaggle for experimenting with the presented Network Architecture. Some parts of the code are inspired by the starter notebooks in Jovian.ml by Aakash and this Github repository from Jindong Wang, author of the same survey mentioned previously (Jindong Wang et al 2017).
After loading the dataset to Kaggle we run the Jupyter notebook with different learning rates (0.0001, 0.005, 0.01 and 0.02), different number of epochs (between 50 and 100) and varying the size of the fully connected layers. The results are committed to Jovian.ml, so that we are able to track the evolution of the performance in this link, having very small variations in the test accuracy, except with the smallest learning rate (0.0001) where the accuracy got stuck at ~35% even after 100 epochs.
The best trial achieved 75.1% accuracy measured by the test data as shown below
The results are not too bad for a start. Unless you compare them with those from Almaslukh et al., 2017, who achieved 97.5% accuracy with the same dataset.