There is an amount of designed features (SIFT, SURF, or DAISY) which has been chosen in the standard implementation of some visual recognition and multimedia challenges. The power of these features lie on their invariance designed against rotation, scaling, and translation. Recent trends in deep learning, however, have pointed out that data-driven features learning performs better designed features in some tasks, since they can capture the global (via multi-layers network) or inter-local structures (convolutional network) of images. We argue that combining the two types of features can significantly improve visual object recognition performance. We propose in this paper a framework that uses sparse coding and the fusion of learned and designed features in order to build descriptive codewords. Evaluations on Caltech-101 and 15 Scenes validates our argument, with a better result compared with recent approaches.