Human Activity Recognition (HAR) has been a compelling problem in the field of computer vision since a long time. Our focus is to address the problem of trimmed activity recognition which is to identify the class of human activity in a video which is temporally trimmed to contain only those periods where human activity is present. In the past few years there has been a transition from handcrafted features for classification to deep convolutional neural networks which work on raw video data to extract features and classify human activities. 3D convolutional neural networks learn features from both the temporal as well as spatial dimensions and prove to be very powerful in finding correlations in signals containing spatiotemporal information. 3D-CNNs have been extremely successful in activity recognition. We explore the shortcomings of a 3D-CNN architecture and propose ensembling with a 2D-CNN to overcome these for a significantly better performance in activity recognition. © 2019 IEEE.