The paper covers study of various techniques for key frames based video summarization available in the literature. There have been tremendous needs of video processing applications to deal with abundantly available & accessible videos. One of the research areas of interest is Video Summarization that aims creating summary of video to enable a quick browsing of a collection of large video database. It is also useful for allied video processing applications like video indexing, retrieval etc. Video Summarization is a process of creating & presenting a meaningful abstract view of entire video within a short period of time. Mainly two types of video summarization techniques are available in the literature, viz. key frame based and video skimming. For key frame based video summarization, selection of key frames plays important role for effective, meaningful and efficient summarizing process.
Keywords |
Video Summarization, Key Frame, Video Skim, Euclidian Distance, Depth Factor |
INTRODUCTION |
The rapid development of digital video capture and editing technology led to increase in video data, creating the
need for effective techniques for video retrieval and analysis [2]. |
Advances in digital content distribution and digital video recorders, has caused digital content recording easy.
However, the user may not have enough time to watch the entire video. In such cases, the user may just want to view
the abstract of the video instead of watching the whole video which provides more information about the occurrence of
various incidents in the video. [2] |
As the name implies, video summarization is a mechanism for generating a short summary of a video, which can
either be a sequence of stationary images (key frames) or moving images (video skims) [2]. Video can be summarized
by two different ways which are as follows. |
1) Key Frame Based Video Summarization |
These are also called representative frames, R-frames, still-image abstracts or static storyboard, and a set consists of
a collection of salient images extracted from the underlying video source [2]. Following are some of the challenges that
should be taken care while implementing Key frame based algorithm |
1. Redundancy: frames with minor difference are selected as key frame. |
2. When there are various changes in content it is difficult to make clustering. |
2) Video Skim Based Video Summarization |
This is also called a moving-image abstract, moving story board, or summary sequence [2]. The original video is
segmented into various parts which is a video clip with shorter duration. Each segment is joined by either a cut or a
gradual effect. The trailer of movie is the best example for video skimming. |
The paper is organized as follows. Section II shows related work, Section III gives the overview and classification
of key frame based video summarization. Section IV describes methods for video summarization. And section V
concludes the paper. |
RELATED WORK |
A video summarization is a summary which represents abstract view of original video sequence and can be used as
video browsing and retrieval systems. It can be a highlight of original sequence which is the concatenation of a user
defined number of selected video segments or can be a collection of key frames. Different methods can be used to
select key frames. |
By using triangle model of perceived motion energy (PME) [4] motion patterns are modeled in video. The frames at
the turning point of the motion acceleration and motion deceleration are selected as key frames. The key-frame
selection process is threshold free and fast and the extracted key frames are representative. |
In Visual frame Descriptors algorithm [5] three visual features: color histogram, wavelet statistics and edge direction
histogram are used for selection of key frames. Similarity measures are computed for each descriptor and combined to
form a frame difference measure. Fidelity, Shot Reconstruction Degree, Compression Ratio qualities are used to
evaluate the video summarization [5]. |
In Motion Attention Model [6] shots are detected using color distribution and edge covering ratio that increase the
accuracy of shot detection. Key frames are extracted from each shot by using the motion attention model. Here the first
and last frame of every shots are considered as key frame and the others are extracted by adopting motion attention
model [3][6]. These key frames are then clustered and a priority value is computed by estimating motion energy and
color variation of shots. |
In Multiple Visual Descriptor Features algorithm [7], the key frames are selected by constructing the cumulative
graph for the frame difference values. The frames at the sharp slope indicate the significant visual change; hence they
are selected and included in the final summary. |
Motion focusing method [8] focuses on one constant-speed motion and aligns the video frames by fixing focused
motion into a static situation. A summary is generated containing all moving objects and embedded with spatial and
motion information. Background subtraction and min cut are mainly used in motion focusing. |
In Camera Motion and Object Motion [9], the video is segmented using camera motion-based classes: pan, zoom in,
zoom out and fixed. Final key frame selections from each of these segments are extracted based on confidence value
formulated for the zoom, pan and steady segments. |
KEY FRAMES BASED VIDEO SUMMARIZATION |
As explained in paper [1], key frames based video summarization works on frames so first step is to extract frames
from original video frame sequence. In step two extracted video frames are cluster that have redundant content
obviating the need for shot detection. Selection of key frames is proceeding in step three. The entire procedure is shown
in fig 2. |
As summarized in paper [11], Key frames based video summarization can be classified in three different ways.
These are as follows. |
1) Classification based on sampling |
This method [11] chooses key frames uniformly or randomly under-sampling, without considering the video content.
The summary produced by these methods does not represent all the video parts and may cause some redundancy of key
frames with similar contents. |
2) Classification based on scene segmentation |
This method [11] extracts key frames using scenes detection, the scene includes all parts with a semantic link in the
video or in the same space or in the same time. The disadvantage of these techniques is producing a summary, which
does not take into account the temporal position of frames. |
3) Classification based on shot segmentation |
This method [11] extracts adapted key frames to video content. They extract the first image as shot key frames or the
first and the last frames of the shot. These methods are effective for stationary shot and small content variation, but they
don’t provide an adequate representation of shot with strong movements. |
VIDEO SUMMARIZATION METHOD |
Following are the various key frame extraction methods described by Sujatha C and Mudenagudi U in [3] along
with other methods. |
1) Video Summarization By Clustering Using Euclidean Distance [1] |
This method is based on removing the redundant video frames which has almost similar content. Like many
other approaches, the entire video material is first clustered into nodes, each containing frames of similar visual content [1][10]. By representing each cluster with its most representative frame, a set of key frames is obtained
which then summarizes the given sequence [1][10]. Procedure for this method is shown in fig 3 [1]. |
2) Perceived Motion Energy Model (PME) |
As described by A. Liu et al. [4], Motion is the more salient feature in presenting actions or events in video and,
thus, should be the feature to determine key frames [4]. A triangle model of perceived motion energy to model
motion patterns in video and a scheme to extract key frames based on this model. The PME is a combined metric of
motion intensity and the kind of motion with more emphasis on dominant video motion [3]. The average magnitude
Mag(t) of motion vectors in the entire frame is calculated as described in [3][4] as, |
|
Where represents forward motion vectors and represents backward motion vectors.
N is number of macro blocks in the frame. |
The percentage of dominant motion direction () is defined in [3][4] as, |
|
(t,k) represents the angle histogram with n bins. The PME of a B-frame is computed in [4] as P(t)=Mag(t).α(t). These PME values of the frames are plotted which represent the sequence of motion triangles. The
frames at the turning point of the motion acceleration and motion deceleration are selected as key frames. The key
frame selection process is threshold free and fast [3][4]. Here first the video sequence is segmented into shots using
twin comparison method. The key frames are selected based on the motion patterns within the shots. For shots having motion pattern the triangle model is used to select the key frame, whereas for shots with no motion pattern,
the first frame is chosen as a key frame [4]. The satisfactory rate for sports and entertainment video is found to be
good as more actions exist when compared to home and news video [3]. |
3) Visual frame Descriptors |
G. Ciocca and R. Schettini [5] introduced an algorithm with three visual features: color histogram, wavelet
statistics and edge direction histogram are used for selection of key frames. Similarity measures are computed for
each descriptor and combined to form a frame difference measure. The distance between two color histograms xH
using the intersection measure is given in [5] as, |
|
As defined in [5], the difference between two edge direction histograms D is computed using Euclidean
distance as such in the case of two wavelet statistics W, |
|
These differences are combined to form the final frame difference measure HWD defined in [5] as, |
|
These difference values are used to construct a curve of the cumulative frame differences which describes how
visual content of the frames changes over the entire shot [5]. The high curvature points are determined and by using
two consecutive points key frames are extracted. Following qualities are used to evaluate the video summary [5]. |
1. Fidelity: The Fidelity measure is defined as a semi Hausdorff distance. |
2. Shot Reconstruction Degree (SRD): It uses a suitable frame interpolation algorithm; we should be able to
reconstruct the whole sequence from the set of key frames. |
3. Compression Ratio (CR): CR is defined as ratio of number of key frames and total number of frames in the
video sequence |
4) Motion Attention Model |
I. C. Chang et al. [6] used this model to detect shots. In this model, shots are detected using color distribution
and edge covering ratio that increase the accuracy of shot detection. Key frames are extracted from each shot by
using the motion attention model. Here the first and last frame of every shots are considered as key frame and the
others are extracted by adopting motion attention model [3][6]. These key frames are then clustered and a priority
value is computed by estimating motion energy and color variation of shots. The motion energy TMA is defined in
[6] as, |
|
Where () denotes the sum of motion attention [6] value of shot i. And the energy motion change (EMC) is
defined in [6] as, |
|
Where () denotes the total number of frames that have significant intensity variation in shot i. The priority
value of shot is defined in [6] as, |
|
A higher PV value means that this shot is more important of this cluster and the shot will be the highlight of cluster
[3]. |
5) Multiple Visual Descriptor Features |
Chitra A.D et al. [7] used same visual features as Ciocca [5] along with one additional feature, weighted standard
deviation. The grayscale image is focused to L-level discrete wavelet decomposition. At each ith level (i=1..L) there
are LH,HL,HH detail images and an approximation image at level L. The standard deviation is for all these images
are calculated and the weighted standard deviation feature vector is defined in [7] as, |
|
The key frames are selected by constructing the cumulative graph for the frame difference values. The frames at
the sharp slope indicate the significant visual change; hence they are selected and included in the final summary.
And the key frames corresponding to the mid points between each pair of consecutive curvature point are
considered as representative frames [7]. The algorithm is tested on educational video sequence and compared with
the I-frames obtained by Cue Video and found that the method gives better result [3]. |
6) Motion focusing |
Congcong et al. [8] proposed motion focusing method. This method extracts key frames and generate summary
for lane surveillance videos. This method focuses on one constant-speed motion and aligns the video frames by
fixing focused motion into a static situation. A summary is generated containing all moving objects and embedded
with spatial and motion information. The method begins with background subtraction to extract the moving
foreground for each frame [3][8]. In this method background subtraction is combined with min cut to get a smooth
segmentation of foreground objects. A labeling function f labels each pixel i as foreground = 1 or background i
= 0. The labeling problem is solved minimizing the Gibbs energy, defined in [8] as, |
|
Where 1 and 2 are defined in [8] as, |
|
2 = ( −), is difference between the current frame and the Gaussian mean for the pixel and = 1, 2, 3
are the thresholds for the pixel [3][8]. The key frame extraction and summary image generation is done through
two steps of mosacing. The initial mosacing is done with the foreground segmentation results. A greedy search
method is used to find out the key frames which increase the foreground coverage on the mosaic foreground image
most [8]. Then a second time mosacing is carried on by mosacing the key frames to generate the summarization
image. The summary not only represents all objects in the focused motion but also provide temporal and spatial
relation. |
7) Camera Motion and Object Motion |
Jiebo Luo et al [9] have proposed a method to extract key frames from personal video clips. The key frames are
extracted from consumer video space where the content is unconstrained and lack of pre-imposed structures [3]. The
key frame extraction framework is based on camera motion and object motion. The video is segmented using
camera motion-based classes: pan, zoom in, zoom out and fixed. The key frames are selected from each of these
segments. For zoom in class the focus is on the end of the motion when the object is closest [3][9]. In case of pan
the selection is based on local motion descriptor and global translation parameters. For a fixed segment the mid
frame of the segment or the frame where the object motion is maximum is chosen [9]. Final key frame selections
from each of these segments are extracted based on confidence value formulated for the zoom, pan and steady
segments. The global confidence function is given in [9] as: = 1 + 2 with 1 + 2 = 1,
is probability function of the cumulative camera displacements and = ( + ) is a Gaussian
function, with being the location of local minimum and the standard deviation computed from the translation
curve [9]. |
CONCLUSION |
Video summarization plays important role in many video applications. A survey on various methods for key frame
based video summarization has been carried out. But there is no any universally accepted method available for video
summarization that gives better output in all kinds of videos. The summarization viewpoint and perspective are often
application-dependent. The semantic understanding and its representation are the biggest issues to be addressed for
incorporating diversities in video and human perception. Depending upon the changes in contents of the video, the key
frames are extracted. As the key frames need to be processed for summarization purpose, the important contents must
not be missed. |
Figures at a glance |
|
|
|
Figure 1 |
Figure 2 |
Figure 3 |
|
|
References |
- Sony, A.; Ajith, K.; Thomas, K.; Thomas, T.; Deepa, P.L., "Video summarization by clustering using euclidean distance," SignalProcessing, Communication, Computing and Networking Technologies (ICSCCN), 2011 International Conference on , vol., no.,pp.642,646, 21-22 July 2011
- Truong, B. T. and Venkatesh, S. 2007. Video abstraction: A systematic review and classification. ACM Trans. Multimedia Comput.Commun. Appl. 3, 1, Article 3, Feb. 2007
- Sujatha, C.; Mudenagudi, U., "A Study on Keyframe Extraction Methods for Video Summary," Computational Intelligence andCommunication Networks (CICN), 2011 International Conference on , vol., no., pp.73,77, 7-9 Oct. 2011
- T. Liu, H. J. Zhang, and F. Qi, “A novel video key frame extraction algorithm based on perceived motion energy model,” IEEEtransactions on circuits and systems for video technology, vol. 13, no. 10, Oct 2003, pp 1006-1013.
- G. Ciocca and R. Schettini, “An innovative algorithm for keyframe extraction in video summarization,” Journal of Real-Time ImageProcessing (Springer), vol. 1, no. 1, pp. 69–88, 2006.
- I.C. Chang and K. Y. Cheng, “Content-selection based video summarization,” IEEE International Conference On Consumer Electronics,Las Vegas Convention Center, USA, Jan 2007, pp. 11–14.
- Chitra, Dhawale, and S. Jain, “A novel approach towards key frame selection for video summarization,” Asian Journal of InformationTechnology, vol. 7, no. 4, pp. 133–137, 2008.
- L. Congcong, Y. T. Wu, Y. Shiaw-Shian, and T. Chen, “Motion-focusing key frame extraction and video summarization for lanesurveillance system,” ICIP 2009, pp. 4329–4332.
- J. Luo, C. Papin, and K. Costello, “Towards extracting semantically meaningful key frames from personal video clips:from humans tocomputers,” IEEE Transactions On Circuits And Systems For Video Technology, vol. 19, no. 2, February 2009.
- Nalini Vasudevan, Arjun Jain and Himanshu Agrawal, “Iterative Image Based Video Summarization by Node Segmentation”.
- Sabbar, W.; Chergui, A.; Bekkhoucha, A., "Video summarization using shot segmentation and local motion estimation," InnovativeComputing Technology (INTECH), 2012 Second International Conference on, vol., no., pp.190, 193, 18-20 Sept. 2012
|