We propose an empirical computational model for generating an interpretation of a video shot based on our proposed principle of perceptual prominence. The principle of perceptual prominence captures the key aspects of mise-en-scène required for interpreting a video scene. We present a novel approach for applying perceptual grouping principles to the spatio-temporal domain of video. Our spatio-temporal perceptual grouping scheme, applied on blob tracks, makes use of a specified spatio-temporal coherence model. A high level semantic interpretation of scenes is done using the mise-en-scène features and the perceptual prominence computed for the perceptual clusters. © Springer-Verlag Berlin Heidelberg 2006.