Ταξινόμηση βίντεο με αναδρομικά νευρωνικά δίκτυα
Video classification with recurrent neural networks
View/ Open
Keywords
LSTM ; Αναδρομικά νευρωνικά δίκτυα ; Νευρωνικά δίκτυα ; Classification ; Ταξινόμηση ; Pose estimation ; Video classificationAbstract
The present project aims to conduct video classification by training a network of stacked LSTM cells to
recognize the sport being conducted in a subset of Sports-1M Dataset. The contribution of this project is
that unlike traditional methods on video classification, that feed frame-images to the network, it attempts to
use Carnegie Mellon’s OpenPose pose-estimation library, to extract human poses from a predefined number
of frames and use them as input features to the network. This effort intends to help the network identify
and learn movement patterns from each sport. The main challenge of this undertaking was that Sports-1M
Dataset is a machine generated dataset, that contains user-produced videos and therefore is susceptible to
noise. The latter comes from possible unrelated videos mistakenly selected by YouTube’s annotation system
or the users not focusing on the sport carried out, but instead zooming randomly into the crowd, the face of a
player, zooming out on the empty field etc. Apart from common difficulties unconstrained videos introduce,
such as varied illumination, scale, camera motion, viewpoints etc., this dataset also varies substantially in
duration and resolution. The approach followed to counter the aforementioned challenges, was to define a
fixed window of 30 frames for each video (2 frames per second - aka 15 seconds of video), with the selection
beginning after 30% of video’s run time, in order to increase the probability of encountering the sport in
action. Furthermore, to control the quantity and quality of the people selected from each frame, the people
were filtered through an index of interest, which quantifies how big, complete and central each person is, in
relation to insignificant ones in the frame and use that as a rule to pick the 2 most interesting. Finally, after
hyperparameter investigation, the network was able to produce 89% accuracy, for 5 sport-classes and 73%
for 10 sport-classes. This was achieved through a network of stacked LSTM cells, of 64 and 32 units in depth
respectively, with L1, L2 regularizers applied at each layer, followed by a densely connected Neural Network
with the same amount of units, as the sport-classes.