{"title": "Group Sparse Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 82, "page_last": 89, "abstract": "Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification.", "full_text": "Group Sparse Coding\n\nSamy Bengio\n\nGoogle\n\nMountain View, CA\n\nFernando Pereira\n\nGoogle\n\nMountain View, CA\n\nYoram Singer\n\nGoogle\n\nMountain View, CA\n\nbengio@google.com\n\npereira@google.com\n\nsinger@google.com\n\nDennis Strelow\n\nGoogle\n\nMountain View, CA\n\nstrelow@google.com\n\nAbstract\n\nBag-of-words document representations are often used in text, image and video\nprocessing. While it is relatively easy to determine a suitable word dictionary for\ntext documents, there is no simple mapping from raw images or videos to dictio-\nnary terms. The classical approach builds a dictionary using vector quantization\nover a large set of useful visual descriptors extracted from a training set, and uses a\nnearest-neighbor algorithm to count the number of occurrences of each dictionary\nword in documents to be encoded. More robust approaches have been proposed\nrecently that represent each visual descriptor as a sparse weighted combination of\ndictionary words. While favoring a sparse representation at the level of visual de-\nscriptors, those methods however do not ensure that images have sparse represen-\ntation. In this work, we use mixed-norm regularization to achieve sparsity at the\nimage level as well as a small overall dictionary. This approach can also be used to\nencourage using the same dictionary words for all the images in a class, providing\na discriminative signal in the construction of image representations. Experimen-\ntal results on a benchmark image classi\ufb01cation dataset show that when compact\nimage or dictionary representations are needed for computational ef\ufb01ciency, the\nproposed approach yields better mean average precision in classi\ufb01cation.\n\n1\n\nIntroduction\n\nBag-of-words document representations are widely used in text, image, and video processing [14, 1].\nThose representations abstract from spatial and temporal order to encode a document as a vector of\nthe numbers of occurrences in the document of descriptors from a suitable dictionary. For text\ndocuments, the dictionary might consist of all the words or of all the n-grams of a certain minimum\nfrequency in the document collection [1].\n\nFor images or videos, however, there is no simple mapping from the raw document to descriptor\ncounts. Instead, visual descriptors must be \ufb01rst extracted and then represented in terms of a care-\nfully constructed dictionary. We will not discuss further here the intricate processes of identifying\nuseful visual descriptors, such as color, texture, angles, and shapes [14], and of measuring them at\nappropriate document locations, such as on regular grids, on special interest points, or at multiple\nscales [6].\n\nFor dictionary construction, the standard approach in computer vision is to use some unsupervised\nvector quantization (VQ) technique, often k-means clustering [14], to create the dictionary. A new\nimage is then represented by a vector indexed by dictionary elements (codewords), which for el-\nement d counts the number of visual descriptors in the image whose closest codeword is d. VQ\n\n1\n\n\frepresentations are maximally sparse per descriptor occurrence since they pick a single codeword\nfor each occurrence, but they may not be sparse for the image as a whole; furthermore, such repre-\nsentations are not that robust with respect to descriptor variability.\n\nSparse representations have obvious computational bene\ufb01ts, by saving both processing time in han-\ndling visual descriptors and space in storing encoded images. To alleviate the brittleness of VQ\nrepresentations, several studies proposed representation schemes where each visual descriptor is en-\ncoded as a weighted sum of dictionary elements, where the encoding optimizes a tradeoff between\nreconstruction error and the \u21131 norm of the reconstruction weights [3, 5, 7, 8, 9, 16]. These tech-\nniques promote sparsity in determining a small set of codewords from the dictionary that can be\nused to ef\ufb01ciently represent each visual descriptor of each image [13].\n\nHowever, those approaches consider each visual descriptor in the image as a separate coding prob-\nlem and do not take into account the fact that descriptor coding is just an intermediate step in creating\na bag of codewords representation for the whole image. Thus, sparse coding of each visual descrip-\ntor does not guarantee sparse coding of the whole image. This might prevent the use of such methods\nin real large scale applications that are constrained by either time or space resources. In this study,\nwe propose and evaluate the mixed-norm regularizers [12, 10, 2] to take into account the structure\nof bags of visual descriptors present in images. Using this approach, we can for example specify an\nencoder that exploits the fact that once a codeword has been selected to help represent one of the\nvisual descriptors of an image, it may as well be used to represent other visual descriptors of the\nsame image without much additional regularization cost.\n\nFurthermore, while images are represented as bags, the same idea could be used for sets of images,\nsuch as all the images from a given category.\nIn this case, mixed regularization can be used to\nspecify that when a codeword has been selected to help represent one of the visual descriptors of an\nimage of a given category, it could as well be used to represent other visual descriptors of any image\nof the same category at no additional regularization cost. This form of regularization thus promotes\nthe use of a small subset of codewords for each category that could be different from category to\ncategory, thus including an indirect discriminative signal in code construction.\n\nMixed regularization can be applied at two levels: for image encoding, which can be expressed\nas a convex optimization problem, and for dictionary learning, using an alternating minimization\nprocedure. Dictionary regularization promotes a small dictionary size directly, instead of indirectly\nthrough the sparse encoding step.\n\nThe paper is organized as follows: Sec. 2 introduces the notation used in the rest of the paper, and\nsummarizes the technical approach. Sec. 3 describes and solves the convex optimization problem for\nmixed-regularization encoding. Sec. 4 extends the technique to learn the dictionary by alternating\noptimization. Finally, Sec. 5 presents experimental results on a well-known image database.\n\n2 Problem Statement\n\nv, u \u00b7 v = Pn\n\nWe denote scalars with lower-case letters, vectors with bold lower-case letters such as v. We assume\nthat the instance space is Rn endowed with the standard inner product between two vectors u and\nj=1 ujvj. We also use the standard \u2113p norms k \u00b7 kp over Rn with p \u2208 1, 2, \u221e. We often\n\nmake use of the fact that u \u00b7 u = kuk2, where as usual we omit the norm subscript for p = 2..\nOur main goal is to encode effectively groups of instances in terms of a set of dictionary codewords\nD = {dj}|D|\nj=1. For example, if instances are image patches, each group may be the set of patches in\na particular image, and each codeword may represent some kind of average patch. The m\u2019th group\nis denoted Gm where Gm = {xm,i}|Gm|\ni=1 where each xm,i \u2208 Rn is an instance. When discussing\noperations on a single group, we use G for the group in discussion and denote by xi its i\u2019th instance.\nGiven D and G, our \ufb01rst subgoal, encoding, is to minimize a tradeoff between the reconstruction\nerror for G in terms of D, and a suitable mixed norm for the matrix of reconstruction weights\nthat express each xi as a positive linear combination of dj \u2208 D. The tradeoff between accurate\nreconstruction or compact encoding is governed through a regularization parameter \u03bb.\nOur second subgoal, learning, is to estimate a good dictionary D given a set of training groups\n{Gm}n\nm=1. We achieve these goals by alternating between (i) \ufb01xing the dictionary to \ufb01nd recon-\n\n2\n\n\fstruction weights that minimize the sum of encoding objectives for all groups, and (ii) \ufb01xing the\nreconstruction weights for all groups to \ufb01nd the dictionary that minimizes a tradeoff between the\nsum of group encoding objectives and the mixed norm of the dictionary.\n\n3 Group Coding\n\nTo encode jointly all the instances in a group G with dictionary D, we solve the following convex\noptimization problem:\n\nA\u22c6 = arg minA Q(A, G, D)\n\nwhere\nand\n\nQ(A, G, D) = 1\n\n2 Pi\u2208G(cid:13)(cid:13)(cid:13)xi \u2212P|D|\n\nj=1 \u03b1i\n\n2\n\njdj(cid:13)(cid:13)(cid:13)\n\n\u03b1i\n\nj \u2265 0 \u2200i, j .\n\n+ \u03bbP|D|\n\nj=1 k\u03b1jkp\n\n(1)\n\nj=1 consists of non-negative vectors \u03b1j = (\u03b11\n\nThe reconstruction matrix A = {\u03b1j}|D|\nj , . . . , \u03b1|G|\nj )\nspecifying the contribution of dj to each instance. The second term of the objective weighs the\nmixed \u21131/\u2113p norm of A, which measures reconstruction complexity, with the regularization param-\neter \u03bb that balances reconstruction quality (the \ufb01rst term) and reconstruction complexity.\nThe problem of Eq. (1) can be solved by coordinate descent. Leaving all indices intact except for\nindex r, omitting \ufb01xed arguments of the objective, and denoting by c1 and c2 terms which do not\ndepend on \u03b1r, we obtain the following reduced objective:\n\nQ(\u03b1r) =\n\n\u03b1i\njdj \u2212 \u03b1i\n\n+ \u03bb k\u03b1rkp + c1\n\n1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n2 Xi\u2208G\nxi \u2212Xj6=r\n\uf8eb\n\uf8edXj6=r\n=Xi\u2208G\n\n\u03b1i\nj\u03b1i\n\n2\n\nrdr(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nr(dj \u00b7 dr)\u2212\u03b1i\n\nr(xi \u00b7 dr)+\n\n(\u03b1i\n\nr)2kdrk2\uf8f6\n\n1\n2\n\n\uf8f8+\u03bb k\u03b1rkp +c2 .\n\n(2)\n\nWe next show how to \ufb01nd the optimum \u03b1r for p = 1 and p = 2. Let \u02dcQ be just the reconstruction\nterm of the objective. Its partial derivatives with respect to each \u03b1i\n\nr are\n\n\u2202\n\u2202\u03b1i\nr\n\n\u02dcQ = Xj6=r\n\n\u03b1i\nj(dj \u00b7 dr) \u2212 xi \u00b7 dr + \u03b1i\n\nrkdrk2 .\n\nLet us make the following abbreviation for a given index r,\n\n\u00b5i = xi \u00b7 dr \u2212Xj6=r\n\n\u03b1i\n\nj(dj \u00b7 dr) .\n\n(3)\n\n(4)\n\nr is zero. In the derivation below we therefore\nIt is clear that if \u00b5i \u2264 0 then the optimum for \u03b1i\nemploy \u00b5+\ni = [\u00b5i]+ where [z]+ = max{0, z}. Next we derive the optimal solution for each of the\nnorms we consider starting with p = 1. For p = 1 the objective function is separable and we get the\nfollowing sub-gradient condition for optimality,\n\n0 \u2208 \u2212\u00b5+\n\ni + \u03b1i\n\nrkdrk2 + \u03bb\n\n\u2202\n\u2202\u03b1i\nr\n\n|\u03b1i\nr|\n\n\u21d2 \u03b1i\n\nr \u2208\n\n\u00b5+\n\ni \u2212 [0, \u03bb]\nkdrk2\n\n.\n\n(5)\n\n\u2208[0,1]\n\n| {z }\n\ni = (\u00b5+\n\ni \u2212 \u03bb)/kdrk2.\n\ni \u2265 0 the above subgradient condition for optimality implies that \u03b1r\n\nSince \u03b1r\notherwise \u03b1r\nThe objective function is not separable when p = 2. In this case we need to examine the entire\nset of values {\u00b5+\ni . Assume for now that the\noptimal solution has a non-zero norm, k\u03b1rk2 > 0. In this case, the gradient of Q(\u03b1r) with an \u21132\nregularization term is\n\ni }. We denote by \u00b5+ the vector whose i\u2019th value is \u00b5+\n\ni = 0 when \u00b5+\n\ni \u2264 \u03bb and\n\n\u03b1r\nk\u03b1rk\n\n.\n\nkdrk2\u03b1r \u2212 \u00b5+ + \u03bb\n\n3\n\n\fwhich implies that\n\n\u03bb\n\ns \u00b5+ = (cid:18)kdrk2 +\nkdrk2 (cid:18)1 \u2212\n\nsk\u00b5+k(cid:19)\u22121\nk\u00b5+k(cid:19) .\n\ns =\n\n1\n\n\u03bb\n\n\u00b5+ ,\n\nAt the optimum this vector must be zero, so after rearranging terms we obtain\n\n\u03b1r = (cid:18)kdrk2 +\n\n\u03bb\n\nk\u03b1rk(cid:19)\u22121\n\n\u00b5+ .\n\n(6)\n\nTherefore, the vector \u03b1r is in the same direction as \u00b5+ which means that we can simply write\n\u03b1r = s \u00b5+ where s is a non-negative scalar. We thus can rewrite Eq. (6) solely as a function of the\nscaling parameter s\n\n(7)\n\nWe now revisit the assumption that the norm of the optimal solution is greater than zero. Since s\ncannot be negative the above expression also provides the condition for obtaining a zero vector for\n\u03b1r. Namely, the term 1 \u2212 \u03bb/k\u00b5+k must be positive, thus, we get that \u03b1r = 0 if k\u00b5+k \u2264 \u03bb and\notherwise \u03b1r = s\u00b5+ where s is de\ufb01ned in Eq. (7).\nOnce the optimal group reconstruction matrix A is found, we compress the matrix into a single\nvector. This vector is of \ufb01xed dimension and does not depend on the number of instances that\nconstitute the group. To do so we simply take the p-norm of each \u03b1j, thus yielding a |D| dimensional\nvector. Since we use mixed-norms which are sparsity promoting, in particular the \u21131/\u21132 mixed-norm,\nthe resulting vector is likely to be sparse, as we show experimentally in Section 6.\n\nSince visual descriptors and dictionary elements are only accessed through inner products in the\nabove method, it could be easily generalized to work with Mercer kernels instead.\n\n4 Dictionary Learning\n\nNow that we know how to achieve optimal reconstruction for a given dictionary, we examine how to\nlearn a good dictionary, that is, a dictionary that balances between reconstruction error, reconstruc-\ntion complexity, overall complexity relative to the given training set. In particular, we seek a learning\nmethod that facilitates both induction of new dictionary words and the removal of dictionary words\nwith low predictive power. To achieve this goal, we will apply \u21131/\u21132 regularization controlled by a\nnew hyperparameter \u03b3, to dictionary words. For this approach to work, we assume that instances\nhave been mean-subtracted so that the zero vector 0 is the (uninformative) mean of the data and\nregularization towards 0 is equivalent to removing words that do not contribute much to compact\nrepresentation of groups.\nLet G = {G1, . . . , Gn} be a set of groups and A = {A1, . . . , An} the corresponding reconstruction\ncoef\ufb01cients relative to dictionary D. Then, the following objective meets the above requirements:\n\nQ(A, D) =\n\nn\n\nXm=1\n\nQ(Am, Gm, D) + \u03b3\n\n|D|\n\nXk=1\n\nkdkkp s.t. \u03b1i\n\nm,j \u2265 0 \u2200i, j, m ,\n\n(8)\n\nwhere the single group objective Q(Am, Gm, D) is as in Eq. (1).\nIn our application we set p = 2 as the norm penalty of the dictionary words. For \ufb01xed A, the ob-\njective above is convex in D. Moreover, the same coordinate descent technique described above for\n\ufb01nding the optimum reconstruction weights can be used again here after simple algebraic manipu-\nlations. De\ufb01ne the following auxiliary variables:\n\nvr = Xm Xi\n\n\u03b1i\n\nm,rxm,i and \u03bdj,k = Xm Xi\n\n\u03b1i\nm,j\u03b1i\n\nm,k .\n\n(9)\n\nThen, we can express dr compactly as follows. As before, assume that kdrk > 0. Calculating the\ngradient with respect to each dr and equating it to zero, we obtain\n\nXm Xi\u2208Gm\n\n\uf8eb\n\uf8edXj6=r\n\n\u03b1i\nm,j\u03b1i\n\nm,rdj + (\u03b1i\n\nm,r)2dr \u2212 \u03b1i\n\nm,rxm,i\uf8f6\n\n\uf8f8 + \u03b3\n\ndr\nkdrk\n\n= 0 .\n\n4\n\n\fSwapping the sums over m and i with the sum over j, using the auxiliary variables, and noting that\ndj does not depend neither on m nor on i, we obtain\n\nXj6=r\n\n\u03bdj,rdj + \u03bdr,rdr \u2212 vr + \u03b3\n\ndr\nkdrk\n\n= 0 .\n\n(10)\n\nSimilarly to the way we solved for \u03b1r, we now de\ufb01ne the vector ur = vr \u2212Pj6=r \u03bdj,rdj to get the\n\nfollowing iterate for dr:\n\n(11)\n\ndr = \u03bd\u22121\n\nr,r (cid:20)1 \u2212\n\n\u03b3\n\nkurk(cid:21)+\n\nur ,\n\nwhere, as above, we incorporated the case dr = 0, by applying the operator [\u00b7]+ to the term\n1 \u2212 \u03b3/kurk. The form of the solution implies that we can eliminate dr, as it becomes 0, when-\never the norm of the residual vector ur is smaller than \u03b3. Thus, the dictionary learning procedure\nnaturally facilitates the ability to remove dictionary words whose predictive power falls below the\nregularization parameter.\n\n5 Experimental Setting\n\nWe compare our approach to image coding with previous sparse coding methods by measuring their\nimpact on classi\ufb01cation performance on the PASCAL VOC (Visual Object Classes) 2007 dataset [4].\nThe VOC datasets contain images from 20 classes, including people, animals (bird), vehicles (aero-\nplane), and indoor objects (chair), and are considered natural, dif\ufb01cult images for classi\ufb01cation.\nThere are around 2500 training images, 2500 validation images and 5000 test images in total.\n\nFor each coding technique under consideration, we explore a range of values for the hyperparameters\n\u03bb and \u03b3.\nIn the past, many features have been used for VOC classi\ufb01cation, with bag-of-words\nhistograms of local descriptors like SIFT [6] being most popular. In our experiments, we extract\nlocal descriptors based on a regular grid for each image. The grid points are located at every seventh\npixel horizontally and vertically, which produces an average of 3234 descriptors per image. We\nused a custom local descriptor that collects Gabor wavelet responses at different orientations, spatial\nscales, and spatial offsets from the interest point. Four orientations (0\u25e6, 45\u25e6, 90\u25e6, 135\u25e6) and 27\n(scale, offset) combinations are used, for a total of 108 components. The 27 (scale, offset) pairs were\nchosen by optimizing a previous image recognition task, unrelated to this paper, using a genetic\nalgorithm. Tola et al. [15] independently described a descriptor that similarly uses responses at\ndifferent orientations, scales, and offsets (see their Figure 2). Overall, this descriptor is generally\ncomparable to SIFT and results in similar performance.\n\nTo build an image feature vector from the descriptors, we thus investigate the following methods:\n\n1. Build a bag-of-words histogram over hierarchical k-means codewords by looking up each\ndescriptor in a hierarchical k-means tree [11]. We use branching factors of 6 to 13 and a\ndepth of 3 for a total of between 216 and 2197 codewords. When used with multiple feature\ntypes, this method results in very good classi\ufb01cation performance on the VOC task.\n\n2. Jointly train a dictionary and encode each descriptor using an \u21131 sparse coding approach\n\nwith \u03b3 = 0, which was studied previously [5, 7, 9].\n\n3. Jointly train a dictionary and encode sets of descriptors where each set corresponds to a\n\nsingle image, using \u21131/\u21132 group sparse coding, varying both \u03b3 and \u03bb.\n\n4. Jointly train a dictionary and encode sets of descriptors where each set corresponds to all\ndescriptors or all images of a single class, using \u21131/\u21132 sparse coding, varying both \u03b3 and \u03bb.\nThen, use \u21131/\u21132 sparse coding to encode the descriptors in individual images and obtain a\nsingle \u03b1 vector per image.\n\nAs explained before, we normalized all descriptors to have zero mean so that regularizing dictionary\nwords towards the zero vector implies dictionary sparsity.\n\nIn all cases, the initial dictionary used during training was obtained from the same hierarchical k-\nmeans tree, with a branching factor of 10 and depth 4 rather than 3 as used in the baseline method.\nThis scheme yielded an initial dictionary of size 7873.\n\n5\n\n\fn\no\ni\ns\ni\nc\ne\nr\nP\ne\ng\na\nr\ne\nv\nA\nn\na\ne\n\nM\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\nMean Average Precision vs Dictionary Size\n\nHKMeans\n\u21131 ;\u03b3 = 0;\u03bb vary\n\u21131/\u21132 ;\u03b3 = 0;\u03bb vary;group=image\n\u21131/\u21132 ;\u03b3 vary;\u03bb = 6.8e \u2212 5;group=image\n\u21131/\u21132 ;\u03b3 = 0;\u03bb vary;group=class\n\u21131/\u21132 ;\u03b3 vary;\u03bb vary;group=class\n\n500\n\n1000\nDictionary Size\n\n1500\n\n2000\n\nFigure 1: Mean Average Precision on the 2007 PASCAL VOC database as a function of the size of\nthe dictionary obtained by both \u21131 and \u21131/\u21132 regularization approaches when varying \u03bb or \u03b3. We\nshow results where descriptors are grouped either by image or by class. The baseline system using\nhierarchical k-means is also shown.\n\nTo evaluate the impact of different coding methods on an important end-to-end task, image classi-\n\ufb01cation, we selected the VOC 2007 training set for classi\ufb01er training, the VOC 2007 validation set\nfor hyperparameter selection, and the VOC 2007 test set for for evaluation. After the datasets are\nencoded with each of the methods being evaluated, a one-versus-all linear SVM is trained on the\nencoded training set for each of the 20 classes, and the best SVM hyperparameter C is chosen on\nthe validation set. Class average precisions on the encoded test set are then averaged across the 20\nclasses to produce the mean average precision shown in our graphs.\n\n6 Results and Discussion\n\nIn Figure 1 we compare the mean average precisions of the competing approaches as encoding\nhyperparameters are varied to control the overall dictionary size. For the \u21131 approach, achieving\ndifferent dictionary size was obtained by tuning \u03bb while setting \u03b3 = 0. For the \u21131/\u21132 approach,\nsince it was not possible to compare all possible combinations of \u03bb and \u03b3, we \ufb01rst \ufb01xed \u03b3 to be\nzero, so that it could be comparable to the standard \u21131 approach with the same setting. Then we\n\ufb01xed \u03bb to a value which proved to yield good results and varied \u03b3. As it can be seen in Figure 1,\nwhen the dictionary is allowed to be very large, the pure \u21131 approach yields the best performance.\nOn the other hand, when the size of the dictionary matters, then all the approaches based on \u21131/\u21132\nregularization performed better than the \u21131 counterpart. Even hierarchical k-means performed better\nthan the pure \u21131 in that case. The version of \u21131/\u21132 in which we allowed \u03b3 to vary provided the best\ntradeoff between dictionary size and classi\ufb01cation performance when descriptors were grouped per\nimage, which was to be expected as \u03b3 directly promotes sparse dictionaries. More interestingly,\nwhen grouping descriptors per class instead of per image, we get even better performance for small\ndictionary sizes by varying \u03bb.\nIn Figure 2 we compare the mean average precisions of \u21131 and \u21131/\u21132 regularization as average image\nsize varies. When image size is constrained, which is often the case is large-scale applications, all\n\n6\n\n\fn\no\ni\ns\ni\nc\ne\nr\nP\ne\ng\na\nr\ne\nv\nA\nn\na\ne\n\nM\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\nMean Average Precision vs Average Image Size\n\nHKMeans\n\u21131 ;\u03b3 = 0;\u03bb vary\n\u21131/\u21132 ;\u03b3 = 0;\u03bb vary;group=image\n\u21131/\u21132 ;\u03b3 vary;\u03bb = 6.8e \u2212 5;group=image\n\u21131/\u21132 ;\u03b3 = 0;\u03bb vary;group=class\n\u21131/\u21132 ;\u03b3 vary;\u03bb vary;group=class\n\n500\n\n1000\n\n1500\n\n2000\n\nAverage Image Size\n\nFigure 2: Mean Average Precision on the 2007 PASCAL VOC database as a function of the average\nsize of each image as encoded using the trained dictionary obtained by both \u21131 and \u21131/\u21132 regular-\nization approaches when varying \u03bb and \u03b3. We show results where descriptors are grouped either by\nimage or by class. The baseline system using hierarchical k-means is also shown.\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\n1600\n\n1800\n\n2000\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nFigure 3: Comparison of the dictionary words used to reconstruct the same image. A pure \u21131 coding\nwas used on the left, while a mixed \u21131/\u21132 encoding was used on the right plot. Each row represents\nthe number of times each dictionary word was used in the reconstruction of the image.\n\nthe \u21131/\u21132 regularization choices yield better performance than \u21131 regularization. Once again \u21131\nregularization performed even worse than hierarchical k-means for small image sizes\nFigure 3 compares the usage of dictionary words to encode the same image, either using \u21131 (on the\nleft) or \u21131/\u21132 (on the right) regularization. Each graph shows the number of times a dictionary word\n(a row in the plot) was used in the reconstruction of the image. Clearly, \u21131 regularization yields\nan overall sparser representation in terms of total number of dictionary coef\ufb01cients that are used.\nHowever, almost all of the resulting dictionary vectors are non-zero and used at least once in the\ncoding process. As expected, with \u21131/\u21132 regularization, a dictionary word is either always used or\nnever used yielding a much more compact representation in terms of the total number of dictionary\nwords that are used.\n\n7\n\n\fOverall, mixed-norm regularization yields better performance when the problem to solve includes\nresource constraints, either time (a smaller dictionary yields faster image encoding) or space (one\ncan store or convey more images when they take less space). They might thus be a good \ufb01t when\na tradeoff between pure performance and resources is needed, as is often the case for large-scale\napplications or online settings.\n\nFinally, grouping descriptors per class instead of per image during dictionary learning promotes the\nuse of the same dictionary words for all images of the same class, hence yielding some form of weak\ndiscrimination which appears to help under space or time constraints.\n\nAcknowledgments\n\nWe would like to thanks John Duchi for numerous discussions and suggestions.\n\nReferences\n\n[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, Harlow,\n\nEngland, 1999.\n\n[2] J. Duchi and Y. Singer. Boosting with structural sparsity.\n\nMachine Learning (ICML), 2009.\n\nIn International Conference on\n\n[3] M. Elad and M. Aharon. Image denoising via sparse and redundant representation over learned\n\ndictionaries. IEEE Transaction on Image Processing, 15(12):3736\u20133745, 2006.\n\n[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.\n\nThe\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\n\n[5] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Ef\ufb01cient sparse coding algorithms. In Advances in\n\nNeural Information Processing Systems (NIPS), 2007.\n\n[6] D. G. Lowe. Object recognition from local scale-invariant features. In International Confer-\n\nence on Computer Vision (ICCV), pages 1150\u20131157, 1999.\n\n[7] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In\n\nInternational Conference on Machine Learning (ICML), 2009.\n\n[8] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE\n\nTransaction on Image Processing, 17(1), 2008.\n\n[9] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriminative sparse image\nmodels for class-speci\ufb01c edge detection and image interpretation. In European Conference on\nComputer Vision (ECCV), 2008.\n\n[10] S. Negahban and M. Wainwright. Phase transitions for high-dimensional joint support recov-\n\nery. In Advances in Neural Information Processing Systems 22, 2008.\n\n[11] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.\n\n[12] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classi\ufb01cation.\n\nTechnical Report 743, Dept. of Statistics, University of California Berkeley, 2007.\n\n[13] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy\n\nemployed by v1? Vision Research, 37, 1997.\n\n[14] P. Quelhas, F. Monay, J. M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. J. Van Gool.\nIn International Conference on\n\nModeling scenes with local descriptors and latent aspects.\nComputer Vision (ICCV), 2005.\n\n[15] E. Tola, V. Lepetit, and P. Fua. A fast local descriptor for dense matching. In Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.\n\n[16] J. yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding\nIn Proceedings of the IEEE Conference on Computer Vision and\n\nfor image classi\ufb01cation.\nPattern Recognition (CVPR), 2009.\n\n8\n\n\f", "award": [], "sourceid": 865, "authors": [{"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Fernando", "family_name": "Pereira", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Dennis", "family_name": "Strelow", "institution": null}]}