Comparison of Neural Networks with Feature Extraction Methods for Depth Map Classification

In this paper, a comparison between feature extraction methods (Radon Cosine Method, Canny Contour Method, Fourier Transform, SIFT descriptor, and Hough Lines Method) and Convolutional Neural Networks (proposed CNN and pre-trained AlexNet) is presented. For the evaluation of these methods, depth maps were used. The tested data were obtained by Microsoft Kinect camera (IR depth sensor). The feature vectors were classified by the Support Vector Machine (SVM). The confusion matrix for the evaluation of experimental results was used. The row of confusion matrix represents target class of tested data and the column represents predicted class. From the experimental results, it is evident that the best results were achieved by proposed CNN (97.4%). On the other hand, the pre-trained AlexNet scored 93.7%.


Introduction
Hand gestures can be seen from multiple points of view. Firstly, they can be represented by the motion of the hand. Secondly, they can be represented by the shape of the hand (position of fingers). The shape of the hand is determined by the position of fingertips relative to the palm. For example, one finger straight up and others folded into fist is a simple gesture for number one. Following this, two straight fingers represent number two, etc. The question is how to describe this shape most effectively. There are several methods to do so, e.g. original RGB colour space is transformed into YCbCr and K-means segmentation method is applied. Subsequently, orientation of hand in picture is detected by calculating simple ration between width and height of hand region. Thumb is detected by measuring pixels on the side of hand. Moreover,

Feature Extraction
Feature extraction is a type of dimensionality reduction that efficiently represents interesting parts of an image as a compact feature vector (extracting the information from the raw data that is most relevant for discrimination between the classes). Deep learning models can also be used for automatic feature extraction algorithms.

Radon Cosine Method
Firstly, Discrete Radon Transform (DRT) [8] is applied in a range of angles to produce Radon spectrum image. The Radon spectrum image ( Fig. 1) appears to be contracted from multiple waves. Next, Discrete Cosine Transform (DCT) is used to describe these waves. The spectrum of the image is rearranged after the transform in such a way that lower frequencies are located in the upper left corner and higher frequencies are located in the lower corner. The data in the lower right corner can be neglected without a significant loss of data. The basic block diagram of Radon Cosine Method (RCM) is shown in Fig. 2

Fig. 2 The block diagram for Radon Cosine Method
The output of this method is a resized image. A range of several sizes was tested to find the highest precision. The best result was measured for a resized image of 20 by 20 pixels. This image is transformed into feature vector as the last step. To evaluate this method, the precision, recall and F1 measures were used. The experimental results in confusion matrix are stored (Fig. 3). Each field in this matrix contains the sum of all images.

Fig. 3 Confusion matrix for Radon Cosine Method
The rows in this matrix represent actual class of tested data (Target class) and each column represents predicted class. The green colour ( Fig. 3) represents 100% true positive prediction. On the other hand, the blue colour represents true positive prediction but less than 100% (true positive). Finally, the red colour represents the false prediction (false positive, false negative, true negative). This method recognizes 10 classes on 100%.

Canny Contour Method
Canny Contour Method (CCM) [9] uses Canny Edge Detector (CED), as you can see in Fig. 4. This method works in three steps (Fig. 5)  Secondly, it applies Gaussian filter to blur the image. Function blur from OpenCV Library (OCVL) was used with 3 × 3 kernel size. Finally CED was applied. a) b) c)

Fig. 5 Image transform in CCM, a) input keyframe, b) after resize and application of Gaussian filter, c) contour image after CED
The process of non-maximal values suppression is executed to reach the required precision of edge detection. Thinner (sharpen) lines are the result after this step. The resulted binary image is done by applying threshold [9,10].
The achieved results using CCM in a confusion matrix are displayed (Fig. 6). Each field in this matrix contains the sum of images. The rows in this matrix represent actual class of tested data (Target class), and each column represents predicted class.

Fourier Transform Method
The Fourier Transform Method (FTM) [11,12] is divided into two main stages (Fig. 7), the image transform (green part) and Fourier transform (red part).
Moreover, each column is taken from this image and first nonzero data is found. The height of such values in all columns of an image creates the signal vector. The size of signal vector is 150, which is the width of the image. The spectrum of an image contains information about energy of harmonic signals in original image. More energy is concentrated in smaller frequencies than in higher frequencies.

Fig. 8 Image transform in FTM, a) input keyframe, b) binary image, c) contour image, d) after Polar-Cartesian transform, e) Contour-Curve image
Therefore, higher frequencies contain less information and are similar to noise. The spectrum of this signal is acquired after DFT with 150 samples. The achieved experimental results based on Fourier Transform Method in Fig. 9 are presented.

SIFT Descriptor
The Scale-Invariant Feature Transform (SIFT) [13] is a widely used algorithm for detection and description of local image features (Fig. 10). Firstly, the key points are found. The Laplace-Gauss filter is applied to produce a series of smoothed images. Key points are located as extremes in this series.
Secondly, the located key points are described. Gradients are created in matrix 4 × 4 for each point. Next, several parameters can be filtered to exclude inappropriate key points. Finally, feature points of the size smaller than 20 pixels are excluded [13]. The achieved confusion matrix can be seen in Fig. 11.

Hough Lines Method
The Hough Lines Method (HLM) [14] is closely related to Radon transform. As opposed to previously mentioned methods, Hough Lines Method (HLM) produce more than one feature vector per input image (Fig. 13). Each vector is composed of ρ and θ, where ρ is the distance of Hough line from the origin and θ is the angle of line to the horizon. The angle of 0° has a vertical line and the angle of π/2 has a horizontal line.
The lines are extracted using Hough Lines function from OpenCV. The binary image from keyframe is the input for this function. The algorithm steps can be seen in Fig. 12. These lines (Fig. 13d) are used to describe the shape of an object as a feature vector.  No relation is maintained between lines from one picture. Thus, a specific line can occur in multiple pictures. Moreover, precision is lowered due to this (Fig. 14). The input keyframe is resized to eliminate insignificant lines.

AlexNet
The AlexNet (Fig. 15) is a deep Convolutional Neural Network (CNN) to classify the 1.3 million high-resolution images in the LSVRC-2010 ImageNet training dataset. This dataset contains 1000 different object categories (classes) such as pencil, keyboard or many types of animals [15,16].
The AlexNet (Fig. 15) consists of five convolutional layers, three Max Pooling layers and three Fully Connected layers (FC). The layers labelled by red color are convolutional layers. The Pooling layers are shown in green colour and layers marked in yellow are Fully Connected layers [17,18]. The rows in this matrix (Fig. 13) represent the actual class of tested data (Target class) and each column represents the predicted class. The input images for AlexNet must be colour images. In our case, the input images were grayscale. These images were converted to RGB images (3-channel RGB image). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  1 7 0 0 1 0 0 0 1 0 0 1 0 0 0 Fig. 15 Basic architecture of AlexNet [15] The results from AlexNet are stored in a confusion matrix (Fig. 16). Each field in this matrix contains the sum of pictures. The green colour (Fig. 16) represents 100% true positive prediction. On the other hand, the blue color represents true positive prediction but less than 100% (true positive). Finally, the red colour represents false prediction (false positive, false negative, true negative).

Experimental Results
The obtained experimental results will be presented in this section. The whole system (Fig. 17) is divided into two classification parts (the training part and the testing part). Prior to classification, the procedures are identical for both parts. Firstly, the captured depth video sequence is processed by segmentation algorithm. The hand region is extracted from the background and keyframes are selected here. Secondly, feature extraction method is applied to these segmented keyframes. Resulted feature vectors are the input for classification.
Feature vectors determined for training are labelled by their respective class. Model for static recognition is generated in the training part. Using this model and given feature vector, static recognition will predict 3 labels for static classes. Two motion vectors are added to these labels to create feature vector for dynamic recognition. First motion vector stores difference in a hand position between the first and the second keyframes. Similarly, for the second motion vector, the range is the 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  1 10 Fig. 17 Overview of the proposed system

Dataset
This dynamic gesture database was produced by 10 actors (3 females and 7 males). As some shapes are shared with multiple gestures, this leaves only 15 unique hand shapes. The image dataset contains these 15 static gestures (Fig. 18). These are 15 classes which give a total of 1350 images for training and 150 images for testing. All test sequences were produced by Microsoft Kinect camera at 640 by 480 pixels. This database contains information about the whole dynamic gesture. In regards to different sequence gesture tempos, they had not the same length. Moreover, these sequences were re-processed to 150 by 150 pixels resolution.

Proposed CNN
The CNN consists of several layers. Each layer occupies a multi-dimensional array of numbers and creates an additional multi-dimensional array of numbers as output. To create the CNN architectures, these types of layers have been used: Convolution (Conv) Layer, ReLU layer, Max Pooling Layer and Fully-Connected Layer [17,18]. The CNN is a network composed of layers that transform an input image from the original pixel values to the final layer score (layer by layer).
Our proposed CNN has two convolutional layers excluding the source input data layer, two fully-connected layers, three ReLU layers and max pooling layer (Fig. 19). Each layer has multiple functional maps, each of which can extract one selected feature through a convolution filter and it contains multiple neurons. This proposed CNN (Fig. 19) is divided into 9 main blocks (A-I): • block A -the input data images reshaped as vectors were used, • block B --this block describes the 2D CNN layer which has 32 feature maps with 3 × 3 kernel dimension, • block C -Rectifier linear unit (ReLU) was used (its derivate is either 0 or 1), • block D -MaxPooling layers with dimension 2 × 2 were used and dropped out with probability 0.25. The Max Pooling is downsampling in CNN. This downsampling is created by applying a Max filter (in our case 2 × 2 filter), • block E -in this block the same 2D CNN were used with parameters as in block B, • block F -Rectifier linear unit (ReLU) was used (its derivate is either 0 or 1), • block G -the standard dense layer was used, • block H -the output of the last dropout layer would be passed to a 5-way Softmax of the loss layer, • block I -the final output is Softmax activation function (validation of the training progress). In the proposed CNN (Fig. 19), the pooling operations are applied separately to each feature map. Generally, the more convolutional steps we have, the more complex the features of our proposed network will be able to learn to recognize. For example, in the classification of images, CNN can learn to detect the margins of the raw pixels in the first layer, then use the margins to detect simple shapes in the second layer, and then use these shapes to detect higher-level features (face shapes) in the higher layers.

Fig. 20 Confusion matrix for proposed CNN
The results from proposed CNN are stored in a confusion matrix (Fig. 20). Each field in this matrix contains the sum of pictures. The rows in this matrix represent the actual class of tested data (Target class) and each column represents the predicted class. For example, in Fig. 20, for row 10, which is for the input images of class 10, two images were wrongly recognized as images of class 11 and 8 images were correctly recognized as images of class 10. The proposed CNN can be used within the project PREDICON (the short-term PREDICtion of photovoltaic energy production for needs of pOwer supply of Intelligent BuildiNgs) to identify clouds in the sky.

Results
The problem of multiclass classification can be understood as a set of many problems with binary classification -one for each class. To evaluate the methods described in the bellow, the precision (the number of items that have been correctly identified as positive from the total positive items), recall (the number of items that have been cor- rectly identified as positive from total real positive items) and F1 (the harmonious average of precision and recall) measures were used. These methods with proposed CNN were compared (Table 1). To ensure the real-time capability of system, 3 time parameters were measured. The time needed to train the model is the first measured time parameter. The time to calculate the feature vector (FV) is the next one. The overall time of prediction is the last measured time. This parameter represents the time needed for feature vector extraction and prediction of Support Vector Machines (SVM) model onto that vector. The evaluation scores: • precision is the ration of the true positive and the sum of positive data. The approval of actual labels with classifier classes is calculated by adding all true positive and false positive in the system. The formula for precision is as follows: ( ) • measurement F1 is a combination of precision and recall. It is calculated as the harmonic mean in the following formula: Next, the combination of the two proposed cluster separation methods {Simple Euclidean Distance (SED) and Mean Value Distance (MVD)) and the two keyframe extraction methods (Global Vector Median (GVM) and Local Vector Median (LVM)} were tested. The Simple Euclidean Distance (SED) method calculates Euclidean distance of two neighbouring frames. It is a distance of two vectors (reshaped frame to vector by lines) in n-th dimensional space. Its size is given by the number of depth map pixels. Thus, if a sequence contains 14 frames, there are 13 neighbouring distance features. Let d be the vector of SED's distances, S be the a frame sequence of length E and L2 be the function that calculates Euclidean distance of two vectors:

RCM
The Mean Value Distance (MVD) method calculates the 2D mean value of differences of two neighbouring frames. Let m be the vector of MVD's distance presented as follows: where S is the frame sequence of length E and 'mean' is the function that calculates the mean value of a vector. The Global Vector Median (GVM) method uses global feature to measurement of frame similarity. Let Q be the frame cluster with a length of F-frames and A is the matrix of Euclidean distances. It consists of Euclidean distances between all frame combinations forming the cluster. The minimum value of a vector determines the position of frame F that represents whole cluster.
In opposite to GVM, the LVM method uses the local feature to measure the frame similarity. Let the medmn be the median value of all cluster frames at the spatial m-th and n-th positions, as follows: where M is the width and N is the height of the depth map frame. The intention was to select an appropriate combination of keyframe extraction methods and keyframe match methods. The three significant depth map images that represent dynamic gesture were selected. It can be seen in Fig. 21. For the purpose of complex methods evaluation, parameter KFM (Key Frame Match) was applied. Its goal is the objective measurement of differences in the artificial approach to keyframe extraction (Fig. 21). Let the Xa, g be the value of keyframe position in an appropriate cluster achieved by proposed method. Likewise, let the La, g and Ha, g be lower and higher values of keyframe position range. If the Xa,g falls into the range < La,g, Ha,g >, the matrix element Dr cluster (a, g) acquires 1 as follows:

Cluster 1 Cluster 2 Cluster 3
The final comparison of the developed methods is presented in the Fig. 22. In an ideal case, a superior method will be located in top-right corner. As it can be seen, SED-GVM method achieved the best score in both parameters.

Conclusion
In this study, we performed a comparison between Convolutional Neural Networks (specifically AlexNet and proposed CNN) and feature extraction methods, such as Radon Cosine Method, Canny Contour Method, Fourier Transform Method, SIFT descriptor and Hough Lines Method. We have tested the performance of these meth-ods on a set of depth maps. We have captured depth maps of various hand gestures by Microsoft Kinect camera. For the image classification, we have used the SVM algorithm. In conclusion, we can state that the most satisfactory results were achieved by the proposed CNN, in which the algorithm reached the F1 parameter of 97.4% (the combination of precision and recall). On the other hand, the pre-trained model AlexNet reached the F1 parameter of 93.7%. However, it should be stated that AlexNet model is not dedicated for grayscale images. Standard feature extraction methods use predefined kernel (RCM, FTM, SIFT, filters as Canny, etc.). The best experimental results (about 91.9%) were obtained by Radon Cosine Transform (RCM). In contrast, CNN constructs kernel from statistical data in training dataset. Layers close to the input of the network act like feature extractors and layers at the end as classifiers.
In future work, we will explore the use of a Convolutional Neural Network for complex cloud recognition system. The test dataset for this system will be composed of the depth cloud maps. This dataset will be created by our department. We also plan to improve the precision of the proposed Convolutional Neural Network and test this proposed system on a larger dataset (images of the clouds in the sky) within project PREDICON (the short-term PREDICtion of photovoltaic energy production for needs of pOwer supply of Intelligent BuildiNgs). As clouds have various shapes, it will be purely on Convolutional Neural Network to find suitable patterns for class recognition. For this reason, we plan to design new activation functions.