The article deals with the problems of motion detection, object recognition, and scene description using deep learning in the framework of granular computing and Z-numbers. Since deep learning is computationally intensive, whereas granular computing, on the other hand, leads to computation gain, a judicious integration of their merits is made so as to make the learning mechanism computationally efficient. Further, it is shown how the concept of z-numbers can be used to quantify the abstraction of semantic information in interpreting a scene, where subjectivity is of major concern, through recognition of its constituting objects. The system, thus developed, involves recognition of both static objects in the background and moving objects in foreground separately. Rough set theoretic granular computing is adopted where rough lower and upper approximations are used in defining object and background models. During deep learning, instead of scanning the entire image pixel by pixel in the convolution layer, we scan only the representative pixel of each granule. This results in a significant gain in computation time. Arbitrary-shaped and sized granules, as expected, perform better than regular-shaped rectangular granules or fixed-sized granules. The method of tracking is able to deal efficiently with various challenging cases, e.g., tracking partially overlapped objects and suddenly appeared objects. Overall, the granulated system shows a balanced trade-off between speed and accuracy as compared to pixel level learning in tracking and recognition. The concept of using Z-numbers, in providing a granulated linguistic description of a scene, is unique. This gives a more natural interpretation of object recognition in terms of certainty toward scene understanding. © 2019, Springer-Verlag London Ltd., part of Springer Nature.