Open Access System for Information Sharing

Department of Creative IT Engineering (창의IT융합공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Learning Temporal Dynamics for Video Action Recognition

Title: Learning Temporal Dynamics for Video Action Recognition

Authors: 권희승

Date Issued: 2021

Publisher: 포항공과대학교

Abstract: 본 박사학위 논문은 비디오 클립에서 일어나는 동작을 인식하는 문제를 다룬다. 일반적으로 비디오 동작 인식은 수백 장의 이미지 프레임들로 구성된 짧은 비디오 클립에서의 동작을 인식해야 하는데, 이를 위해서는 비디오의 시간 흐름에 따른 정 보 변화를 효과적으로 포착하는 동시에 수백 장의 프레임들을 효율적으로 처리하는 방법이 필요하다. 본 논문에서는 신경망에서 추출하는 특징을 활용하여 적은 연산량으로 비디오 내의 시간에 따른 변화를 포착하는 세 가지 방법들을 제안한다. 첫 번째 주요 챕터에서는, 학습된 컨볼루션 신경망으로부터 얻은 프레임 별 특징들을 활용하여 비디오의 통합된 맥락을 잘 포착하는 기법을 소개한다. 이 챕터에서는 추출한 프레임 별 특징들을 시간 축으로 풀링하는 여러 기법들에 대해 제안하고 효과를 분석한 결과를 보여준다. 여러 시간 축 풀링 기법들 중 이웃한 프레임 별 특징들 간의 차이를 이용하는 그라디언트 기반 풀링 기법들이 동작 인식에서 유의미한 성능 향상을 보였는데, 이는 하이레벨의 특징 차이값을 이용하는 것이 비디오에서의 시간에 따른 변화를 잘 담아낸다는 것을 보여준다. 제안하는 풀링 기법들은 추가적인 파라미터나 학습을 필요로 하지 않기 때문에 연산상 효율적이며, 제안한 기법들의 조합을 통해 생성한 최종 비디오 표현 벡터는 동작 인식 벤치마크에서 당시 최고 성능을 달성했다. 두 번째 주요 챕터에서는, 특징 레벨의 차이값이 유의미한 정보를 생성할 수 있다는 첫 번째 챕터의 결과에 기반하여 비디오의 이웃한 프레임들간의 짧은 시간 정보 변화를 포착하는 모션 추출 방법을 소개한다. 이 챕터에서 제안하는 모션 추출 모듈은 이웃한 프레임 별 특징간의 관련성을 학습하는데, 이 모듈은 컨볼루션 신경망의 중간에 삽입되어 엔드투엔드 형태로 모션 특징들의 학습을 가능하게 한다. 논문에서 제공하는 다양한 애블레이션 연구들은 제안한 모듈이 효과적임을 입증하며, 모듈을 포함한 아키텍쳐는 다른 아키텍쳐와의 비교에서 동작 인식 정확도, 연산량, 파라미터 수에서 최적의 균형을 보여준다. 논문의 세 번째 주요 챕터에서는, 이웃한 프레임 별 특징간의 관련성 학습을 확장한 시공간적 자기 유사성을 소개하고, 그로부터 전반적이고 포괄적인 모션 특징들을 학습하는 방법을 소개한다. 충분한 볼륨의 시공간적 자기 유사성 텐서는 두 프레임간의 모션을 나타내는 기존의 옵티컬 플로우보다 더 풍부하고 포괄적인 시간 정보 변화를 포함한다. 논문에서는 이 시공간적 자기 유사성 텐서로부터 효과적으로 자기유사성 특징들을 추출하고, 추출된 특징들을 통합시키는 방법을 제안하며, 이를 효율적인 하나의 블럭으로 구현한다. 이 블럭 또한 두번째 챕터의 모듈과 마찬가지로, 컨볼루션 신경망 중간에 삽입되어 엔드투엔드 형태로 자기유사성 특징들을 학습한다. 제공되는 다양한 분석 결과들은 구현된 자기 유사성 특징들이 비디오 내의 전반적이고 포괄적인 모션 정보를 효과적으로 포착하는 것을 입증하며, 자기 유사성 특징이 삼차원 컨볼루션과의 상호 보완적이며, 아키텍쳐의 강인함을 개선시킬 수 있음을 보여준다. 또한, 다른 아키텍쳐들과의 비교실험에서는 제안한 블럭을 포함한 아키텍쳐가 세 개의 서로 다른 동작 인식 벤치마크에서 최고성능을 달성함을 보여준다.
Action recognition is a fundamental task in video understanding, whose goal is to recognize what people are doing in a given video by classifying it into target action classes. While recent neural models have achieved remarkable progress in image recognition, different challenges arise in recognizing contents in videos. Since a video represents an evolution of images over time, action recognition requires to capture temporal dynamics across the image frames while efficiently processing them in terms of memory and time. In this dissertation, we introduce three different methods to effectively learn temporal dynamics of videos, while consuming a small amount of computational cost. First, we investigate different kinds of temporal pooling operators that aggregate frame-wise features from convolutional neural networks (CNNs), and analyze the effect of the operators. Especially, pooling operators based on differences between adjacent frame-wise features generate an effective video representation without additional parameters. The results of the pooling operators validate that convolutional frame-wise features are useful for capturing temporal dynamics. Second, we propose a motion extraction method that captures temporal dynamics between adjacent frame-wise features following the insight above. The external and heavy computation of optical flows is replaced by a trainable neural module that extracts motion features. The proposed module learns correspondences across frames and converts the correspondences into motion features without additional supervision. The proposed module efficiently extracts motion information at the cost of only 2.5% and 1.2% additional FLOPs and the number of parameters, respectively. Third, we expand the motion extraction method by spatio-temporal self-similarity (STSS), for learning a generalized and rich motion representation. Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. The sufficient volume of STSS is leveraged for capturing far-sighted view on motion, i.e., both short-term and long-term, both forward and backward, as well as spatial self-motion. By implementing the STSS learning process as a neural block, inserted into a neural network, the network efficiently learns the rich temporal context of videos. To demonstrate the effectiveness of the proposed methods, we evaluate the proposed methods on diverse action recognition benchmarks. Experimental results show that the proposed methods successfully capture temporal dynamics with only a small amount of additional cost, and achieve the state-of-the-art on action recognition benchmarks.

URI: http://postech.dcollection.net/common/orgView/200000371430
https://oasis.postech.ac.kr/handle/2014.oak/111056

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Creative IT Engineering (창의IT융합공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse