1 Introduction
Deep neural networks (DNNs) have seen enormous success in image and speech classification, natural language processing, health sciences among other big data driven applications in recent years. However, DNNs typically require hundreds of megabytes of memory storage for the trainable fullprecision floatingpoint parameters, and billions of FLOPs (floating point operations per second) to make a single inference. This makes the deployment of DNNs on mobile devices a challenge. Some considerable recent efforts have been devoted to the training of low precision (quantized) models for substantial memory savings and computation/power efficiency, while nearly maintaining the performance of fullprecision networks. Most works to date are concerned with weight quantization (WQ)
bc_15 ; twn_16 ; xnornet_16 ; lbwn_16 ; carreira1 ; BR_18 . In he2018relu , He et al. theoretically justified for the applicability of WQ models by investigating their expressive power. Some also studied activation function quantization (AQ) bnn_16 ; xnornet_16 ; Hubara2017QuantizedNN ; halfwave_17 ; entropy_17 ; dorefa_16 , which utilize an external process outside of the network training. This is different from WQ at 4 bit or under, which must be achieved through network training. Learning activation function as a parametrized family () and part of network training has been studied in paramrelufor parametric rectified linear unit, and was recently extended to uniform AQ in
pact . In uniform AQ, is a step (or piecewise constant) function in , and the parameter determines the height and length of the steps. In terms of the partial derivative of in , a twovalued proxy derivative of the parametric activation function (PACT) was proposed pact , although we will present an almost everywhere (a.e.) exact one in this paper.The mathematical difficulty in training activation quantized networks is that the loss function becomes a piecewise constant function with sampled stochastic gradient a.e. zero, which is undesirable for backpropagation. A simple and effective way around this problem is to use a (generalized) straightthrough (ST) estimator or derivative of a related (sub)differentiable function
hinton2012neural ; bengio2013estimating ; bnn_16 ; Hubara2017QuantizedNNsuch as clipped rectified linear unit (clipped ReLU)
halfwave_17. The idea of ST estimator dates back to the perceptron algorithm
rosenblatt1957perceptron ; rosenblatt1962principlesproposed in 1950s for learning singlelayer perceptrons with binary output. For multilayer networks with hard threshold activation (a.k.a. binary neuron), Hinton
hinton2012neuralproposed to use the derivative of identity function as a proxy in backpropagation or chain rule, similar to the perceptron algorithm. The proxy derivative used in backward pass only was referred as straightthrough estimator in
bengio2013estimating , and several variants of ST estimator bnn_16 ; Hubara2017QuantizedNN ; halfwave_17 have been proposed for handling quantized activation functions since then. A similar situation, where the derivative of certain layer composited in the loss function is unavailable for backpropagation, has also been brought up by wang2018deeprecently while improving accuracies of DNNs by replacing the softmax classifier layer with an implicit weighted nonlocal Laplacian layer. For the training of the latter, the derivative of a pretrained fullyconnected layer was used as a surrogate
wang2018deep .On the theoretical side, while the convergence of the singlelayer perception algorithm has been extensively studied widrow199030 ; freund1999large , there is almost no theoretical understanding of the unusual ‘gradient’ output from the modified chain rule based on ST estimator. Since this unusual ‘gradient’ is certainly not the gradient of the objective function, then a question naturally arises: how does it correlate to the objective function? One of the contributions in this paper is to answer this question. Our main contributions are threefold:

Firstly, we introduce the notion of coarse derivative and cast the early ST estimators or proxy partial derivatives of in including the twovalued PACT of pact as examples. The coarse derivative is nonunique. We propose a threevalued coarse partial derivative of the quantized activation function in that can outperform the twovalued one pact in network training. We find that unlike the partial derivative which vanishes, the a.e. partial derivative of in is actually multivalued (piecewise constant). Surprisingly, this a.e. accurate derivative is empirically less useful than the coarse ones in fully quantized network training.

Secondly, we propose a novel accelerated training algorithm for fully quantized networks, termed blended coarse gradient descent method (BCGD). Instead of correcting the current full precision weights with coarse gradient at their quantized values like in the popular BinaryConnect scheme bc_15 ; bnn_16 ; xnornet_16 ; twn_16 ; dorefa_16 ; halfwave_17 ; Goldstein_17 ; BR_18 , the BCGD weight update goes by coarse gradient correction of a suitable average of the full precision weights and their quantization. We shall show that BCGD satisfies the sufficient descent property for objectives with Lipschitz gradients, while BinaryConnect does not unless an approximate orthogonality condition holds for the iterates BR_18 .

Our third contribution is the mathematical analysis of coarse gradient descent for a twolayer network with binarized ReLU activation function and i.i.d. unit Gaussian data. We provide an explicit form of coarse gradient based on proxy derivative of regular ReLU, and show that when there are infinite training data, the negative expected coarse gradient gives a descent direction for minimizing the expected training loss. Moreover, we prove that a normalized coarse gradient descent algorithm only converges to either a global minimum or a potential spurious local minimum. This answers the question.
The rest of the paper is organized as follows. In section 2, we discuss the concept of coarse derivative and give examples for quantized activation functions. In section 3, we present the joint weight and activation quantization problem, and BCGD algorithm satisfying the sufficient descent property. For readers’ convenience, we also review formulas on 1bit, 2bit and 4bit weight quantization used later in our numerical experiments. In section 4, we give details of fully quantized network training, including the disparate learning rates on weight and . We illustrate the enhanced validation accuracies of BCGD over BinaryConnect, and 3valued coarse partial derivative of over 2valued and a.e. partial derivative in case of 4bit activation, and (1,2,4)bit weights on CIFAR10 image datasets. We show top1 and top5 validation accuracies of ResNet18 with all convolutional layers quantized at 1bit weight/4bit activation (1W4A), 4bit weight/4bit activation (4W4A), and 4bit weight/8bit activation (4W8A), using 3valued and 2valued partial derivatives. The 3valued partial derivative outperforms the twovalued with larger margin in the low bit regime. The accuracies degrade gracefully from 4W8A to 1W4A while all the convolutional layers are quantized. The 4W8A accuracies with either the 3valued or the 2valued partial derivatives are within 1% of those of the full precision network. If the first and last convolutional layers are in full precision, our top1 (top5) accuracy of ResNet18 at 1W4A with 3valued coarse derivative is 4.7 % (3%) higher than that of HWGQ halfwave_17 on ImageNet dataset. This is in part due to the value of parameter being learned without any statistical assumption.
Notations.
denotes the Euclidean norm of a vector or the spectral norm of a matrix;
denotes the norm. represents the vector of zeros, whereas the vector of all ones. We denote vectors by bold small letters and matrices by bold capital ones. For any , is their inner product. denotes the Hadamard product whose th entry is given by .2 Activation Quantization
In a network with quantized activation, given a training sample of input and label , the associated sample loss is a composite function of the form:
(1) 
where contains the weights in the th linear (fullyconnected or convolutional) layer, ‘’ denotes either matrixvector product or convolution operation; reshaping is necessary to avoid mismatch in dimensions. The th quantized ReLU
acts elementwise on the vector/tensor
output from the previous linear layer, which is parameterized by a trainable scalar known as the resolution. For practical hardwarelevel implementation, we are most interested in uniform quantization:(2) 
where is the scalar input, the resolution, the bitwidth of activation and the quantization level. For example, in 4bit activation quantization (4A), we have and quantization levels including the zero.
Given training samples, we train the network with quantized ReLU by solving the following empirical risk minimization
(3) 
In gradientbased training framework, one needs to evaluate the gradient of the sample loss (1) using the socalled backpropagation (a.k.a. chain rule), which involves the computation of partial derivatives and . Apparently, the partial derivative of in is almost everywhere (a.e.) zero. After composition, this results in a.e. zero gradient of with respect to (w.r.t.) and in (1), causing their updates to become stagnant. To see this, we abstract the partial gradients and , for instances, through the chain rule as follows:
and
where we recursively define , and for as the output from the th linear layer, and ‘’ denotes some sort of proper composition in the chain rule. It is clear that the two partial gradients are zeros a.e. because of the term
. In fact, the automatic differentiation embedded in deep learning platforms such as PyTorch
pytorch would produce precisely zero gradients.To get around this, we use a proxy derivative or socalled ST estimator for backpropagation. By overloading the notation ‘’, we denote the proxy derivative by
The proxy partial derivative has a nonzero value in the middle to reflect the overall variation of , which can be viewed as the derivative of the large scale (stepback) view of in , or the derivative of the clipped ReLU halfwave_17 :
(4) 
On the other hand, we find the a.e. partial derivative of w.r.t. to be
Surprisingly, this a.e. derivative is not the best in terms of accuracy or computational cost in training, as will be reported in section 4. We propose an empirical threevalued proxy partial derivative in as follows
The middle value is the arithmetic mean of the intermediate values of the a.e. partial derivative above. Similarly, a more coarse twovalued proxy, same as PACT pact which was derived differently, follows by zeroing out all the nonzero values except their maximum:
This turns out to be exactly the partial derivative of the clipped ReLU defined in (4).
We shall refer to the resultant composite ‘gradient’ of through the modified chain rule and averaging as coarse gradient. While given the name ‘gradient’, we believe it is generally not the gradient of any smooth function. It, nevertheless, somehow exploits the essential information of the piecewise constant function , and its negation provides a descent direction for the minimization. In section 5, we will validate this claim by examining a twolayer network with i.i.d. Gaussian data. We find that when there are infinite number of training samples, the overall training loss (i.e., population loss) becomes pleasantly differentiable whose gradient is nontrivial and processes certain Lipschitz continuity. More importantly, we shall show an example of expected coarse gradient that provably forms an acute angle with the underlying true gradient of and only vanishes at the possible local minimizers of the original problem.
During the training process, the vector (one component per activation layer) should be prevented from being either too small or too large. Due to the sensitivity of , we propose a twoscale training and set the learning rate of to be the learning rate of weight multiplied by a rate factor far less than 1, which may be varied depending on network architectures. That rate factor effectively helps quantized network converge steadily and prevents from vanishing.
3 Full Quantization
Imposing the quantized weights amounts to adding a discrete setconstraint to the optimization problem (3). Suppose is the total number of weights in the network. For commonly used bit layerwise quantization, takes the form of , meaning that the weight tensor in the th linear layer is constrained in the form for some adjustable scaling factor shared by weights in the same layer. Each component of is drawn from the quantization set given by for (binarization) and for . This assumption on generalizes those of the 1bit BWN xnornet_16 and the 2bit TWN twn_16 . As such, the layerwise weight and activation quantization problem here can be stated abstractly as follows
(5) 
where the training loss is defined in (3). Different from activation quantization, one bit is taken to represent the signs. For ease of presentation, we only consider the networkwise weight quantization throughout this section, i.e., weights across all the layers share the same (trainable) floating scaling factor , or simply, for and for .
3.1 Weight Quantization
Given a float weight vector , the quantization of is basically the following optimization problem for computing the projection of onto set
(6) 
Note that is a nonconvex set, so the solution of (6) may not be unique. When , we have the binarization problem
(7) 
For , the projection/quantization problem (6) can be reformulated as
(8) 
It has been shown that the closed form (exact) solution of (7) can be computed at complexity for (1bit) binarization xnornet_16 and at complexity for (2bit) ternarization lbwn_16 . An empirical ternarizer of complexity has also been proposed twn_16 . At wider bitwidth , accurately solving (8) becomes computationally intractable due to the combinatorial nature of the problem lbwn_16 .
The problem (8) is basically a constrained means clustering problem of 1D points BR_18 with the centroids being spaced. It in principle can be solved by a variant of the classical Lloyd’s algorithm lloyd via an alternating minimization procedure. It iterates between the assignment step (update) and centroid step (update). In the th iteration, fixing the scaling factor , each entry of is chosen from the quantization set, so that is as close as possible to . In the update, the following quadratic problem
is solved by . Since quantization (6) is required in every iteration, to make this procedure practical, we just perform a single iteration of Lloyd’s algorithm by empirically initializing to be , which is derived by setting
This makes the large components in well clustered.
First introduced in bc_15 by Courbariaux et al., the BinaryConnect (BC) scheme has drawn much attention in training DNNs with quantized weight and regular ReLU. It can be summarized as
where denotes the sequence of the desired quantized weights, and is an auxiliary sequence of floating weights. BC can be readily extended to full quantization regime by including the update of and replacing the true gradient with the coarse gradients from section 2. With a subtle change to the standard projected gradient descent algorithm (PGD) combettes2015stochastic , namely
BC significantly outperforms PGD and effectively bypasses spurious the local minima in Goldstein_17 . An intuitive explanation is that the constraint set is basically a finite union of isolated onedimensional subspaces (i.e., lines that pass through the origin) BR_18 . Since is obtained near the projected point , the sequence generated by PGD can get stuck in some line subspace easily when updated with a small learning rate ; see Figure 2 for graphical illustrations.
.
3.2 Blended Gradient Descent and Sufficient Descent Property
Despite the superiority of BC over PGD, we point out a drawback in regard to its convergence. While Yin et al. provided the convergence proof of BC scheme in the recent papers BR_18 , their analysis hinges on an approximate orthogonality condition which may not hold in practice; see Lemma 4.4 and Theorem 4.10 of BR_18 . Suppose has Lipschitz gradient^{1}^{1}1This assumption is valid for the population loss function; we refer readers to Lemma 2 in section 5.. In light of the convergence proof in Theorem 4.10 of BR_18 , we have
(9) 
For the objective sequence to be monotonically decreasing and converging to a critical point, it is crucial to have the sufficient descent property gilbert1992global hold for sufficiently small learning rate :
(10) 
with some positive constant .
Since and , it holds in (9) that
Due to nonconvexity of the set , the above term can be as small as zero even when and are distinct. So it is not guaranteed to dominate the right hand side of (9). Consequently given (9), the inequality (10) does not necessarily hold. Without sufficient descent, even if converges, the iterates may not converge well to a critical point. To fix this issue, we blend the ideas of PGD and BC, and propose the following blended gradient descent (BGD)
(11) 
for some blending parameter . In contrast, the blended gradient descent satisfies (10) for small enough .
Proposition 1.
For , the BGD (11) satisfies
4 Experiments
We tested BCGD, as summarized in Algorithm 1, on the CIFAR10 cifar_09 and ImageNet imagenet_09 ; imagnet_12 color image datasets. We coded up the BCGD in PyTorch platform pytorch . In all experiments, we fix the blending factor in (11) to be . All runs with quantization are warm started with a float pretrained model, and the resolutions are initialized by of the maximal values in the corresponding feature maps generated by a random minibatch. The learning rate for weight starts from . Rate factor for the learning rate of is , i.e., the learning rate for starts from . The decay factor for the learning rates is . The weights and resolutions
are updated jointly. In addition, we used momentum and batch normalization
bn_15 to promote training efficiency. We mainly compare the performances of the proposed BCGD and the stateoftheart BC (adapted for full quantization) on layerwise quantization. The experiments were carried out on machines with 4 Nvidia GeForce GTX 1080 Ti GPUs.The CIFAR10 dataset consists of 60,000 color images of 10 classes, with 6,000 images per class. There dataset is split into 50,000 training images and 10,000 test images. In the experiments, we used the testing images for validation. The minibatch size was set to be and the models were trained for epochs with learning rate decaying at epoch 80 and 140. In addition, we used weight decay of and momentum of . The a.e derivative, 3valued and 2valued coarse derivatives of are compared on the VGG11 vgg_14 and ResNet20 resnet_15 architectures, and the results are listed in Tables 1, 2 and 3, respectively. It can be seen that the 3valued coarse derivative gives the best overall performance in terms of accuracy. Figure 3 shows that in weight binarization, BCGD converges faster and better than BC.
Network  Float  32W4A  1W4A  2W4A  4W4A 
VGG11 + BC  92.13  91.74  88.12  89.78  91.51 
VGG11+BCGD  88.74  90.08  91.38  
ResNet20 + BC  92.41  91.90  89.23  90.89  91.53 
ResNet20+BCGD  90.10  91.15  91.56 
Network  Float  32W4A  1W4A  2W4A  4W4A 
VGG11 + BC  92.13  92.08  89.12  90.52  91.89 
VGG11+BCGD  89.59  90.71  91.70  
ResNet20 + BC  92.41  92.14  89.37  91.02  91.71 
ResNet20+BCGD  90.05  91.03  91.97 
Network  Float  32W4A  1W4A  2W4A  4W4A 
VGG11 + BC  92.13  91.66  88.50  89.99  91.31 
VGG11+BCGD  89.12  90.00  91.31  
ResNet20 + BC  92.41  91.73  89.22  90.64  91.37 
ResNet20+BCGD  89.98  90.75  91.65 
ImageNet (ILSVRC12) dataset imagenet_09 is a benchmark for largescale image classification task, which has million images for training and for validation of 1,000 categories. We set minibatch size to and trained the models for 80 epochs with learning rate decaying at epoch 50 and 70. The weight decay of and momentum of were used. The ResNet18 accuracies 65.46%/86.36% at 1W4A in Table 4 outperformed HWGQ halfwave_17 where top1/top5 accuracies are 60.8%/83.4% with nonquantized first/last convolutional layers. The results in the Table 4 and Table 5 show that using the 3valued coarse partial derivative appears more effective than the 2valued as quantization bit precision is lowered. We also observe that the accuracies degrade gracefully from 4W8A to 1W4A for ResNet18 while quantizing all convolutional layers. Again, BCGD converges much faster than BC towards higher accuracy as illustrated by Figure 4.
Float  1W4A  4W4A  4W8A  
3 valued  2 valued  3 valued  2 valued  3 valued  2 valued  
top1  69.64  64.36/  63.37/  67.36  66.97  68.85  68.83 
top5  88.98  85.65/  84.93/  87.76  87.41  88.71  88.84 
Float  1W4A  4W4A  4W8A  
3 valued  2 valued  3 valued  2 valued  3 valued  2 valued  
top1  73.27  68.43  67.51  70.81  70.01  72.07  72.18 
top5  91.43  88.29  87.72  90.00  89.49  90.71  90.73 
5 Analysis of Coarse Gradient Descent for Activation Quantization
As a proof of concept, we analyze a simple twolayer network with binarized ReLU activation. Let be the binarized ReLU function, same as hard threshold activation hinton2012neural , with the bitwidth and the resolution in (2):
We define the training sample loss by
where and are the underlying (nonzero) teacher parameters in the second and first layers, respectively. Same as in the literature that analyze the conventional ReLU nets Lee ; li2017convergence ; tian2017analytical ; brutzkus2017globally , we assume the entries of
are i.i.d. sampled from the standard normal distribution
. Note that for any scalar . Without loss of generality, we fix .5.1 Population Loss Minimization
Suppose we have independent training samples , then the associated empirical risk minimization reads
(12) 
The major difficulty of analysis here is that the empirical risk function in (12) is still piecewise constant and has a.e. zero partial gradient. This issue can be resolved by instead considering the following population loss minimization li2017convergence ; brutzkus2017globally ; Lee ; tian2017analytical :
(13) 
Specifically, in the limit , the objective function becomes favorably smooth with nontrivial gradient. For nonzero vector , let us define the angle between and by
then we have
Lemma 1.
If every entry of is i.i.d. sampled from , , and , then the population loss is
(14) 
Moreover, the gradients of w.r.t. and are
(15) 
and
(16) 
respectively.
When , the possible (local) minimizers of problem (13) are located at

Nondifferentiable points where and , or and .
Among them, are the global minimizers with .
Proposition 2.
The gradient of the population loss, , holds Lipschitz continuity under a boundedness condition.
Lemma 2.
For any and with and , there exists a constant depending on and , such that
5.2 Convergence Analysis of Normalized Coarse Gradient Descent
The partial gradients and , however, are not available in the training. What we really have access to are the expectations of the sample gradients, namely,
If was differentiable, then the backpropagation reads
(18) 
and
(19) 
Now that has zero derivative a.e., which makes (19) inapplicable. We study the coarse gradient descent with in (19) being replaced by the (sub)derivative of regular ReLU . More precisely, we use the following surrogate of :
(20) 
with , and consider the following coarse gradient descent with weight normalization:
(21) 
Lemma 3.
What is interesting is that the coarse partial gradient is properly defined at global minimizers of the population loss minimization problem (13) with , , whereas the true gradient does not exist there. Our key finding is that the coarse gradient has positive correlation with the true gradient , and consequently, together with give a descent direction in algorithm (21).
Lemma 4.
If , and , then the inner product between the expected coarse and true gradients w.r.t. is
Moreover, the following lemma asserts that is sufficiently correlated with , which will secure sufficient descent in objective values and thus the convergence of .
Lemma 5.
Suppose and . There exists a constant depending on , such that
Theorem 1.
Given the initialization with , and let be the sequence generated by iteration (21). There exists , such that for any step size , is monotonically decreasing, and both and converge to 0, as .
Remark 1.
Combining the treatment of Lee for analyzing twolayer networks with regular ReLU and the positive correlation between and , one can further show that if the initialization satisfies , and , then converges to the global minimizer .
6 Concluding Remarks
We introduced the concept of coarse gradient for activation quantization problem of DNNs, for which the a.e. gradient is inapplicable. Coarse gradient is generally not a gradient but an artificial ascent direction. We further proposed BCGD algorithm, for training fully quantized neural networks. The weight update of BCGD goes by coarse gradient correction of a weighted average of the float weights and their quantization, which yields sufficient descent in objective and thus acceleration. Our experiments demonstrated that BCGD is very effective for quantization at extremely low bitwidth such as binarization. Finally, we analyzed the coarse gradient descent for a twolayer neural network model with Gaussian input data, and proved that the expected coarse gradient essentially correlates positively with the underlying true gradient.
Acknowledgements. This work was partially supported by NSF grants DMS1522383, IIS1632935; ONR grant N000141812527, AFOSR grant FA9550180167, DOE grant DESC0013839 and STROBE STC NSF grant DMR1548924.
Conflict of Interest Statement
On behalf of all authors, the corresponding author states that there is no conflict of interest.
References
 (1) Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
 (2) Bertsekas, D.P.: Nonlinear programming. Athena scientific Belmont (1999)
 (3) Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966 (2017)

(4)
Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by
halfwave gaussian quantization.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 (5) CarreiraPerpinán, M.: Model compression as constrained optimization, with application to neural nets. part i: General framework. arXiv preprint arXiv:1707.01209 (2017)
 (6) Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan, K.: Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)
 (7) Combettes, P.L., Pesquet, J.C.: Stochastic approximations and perturbations in forwardbackward splitting for monotone operators. Pure and Applied Functional Analysis 1, 13–37 (2016)
 (8) Courbariaux, M., Bengio, Y., David, J.: Binaryconnect: Training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems (NIPS), p. 3123–3131 (2015)
 (9) Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A largescale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)
 (10) Du, S.S., Lee, J.D., Tian, Y., Poczos, B., Singh, A.: Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minimum. arXiv preprint arXiv:1712.00779 (2018)
 (11) Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Machine learning 37(3), 277–296 (1999)
 (12) Gilbert, J.C., Nocedal, J.: Global convergence properties of conjugate gradient methods for optimization. SIAM Journal on Optimization 2(1), 21–42 (1992)
 (13) He, J., Li, L., Xu, J., Zheng, C.: Relu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 (2018)
 (14) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
 (15) He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: IEEE International Conference on Computer Vision (ICCV) (2015)
 (16) Hinton, G.: Neural networks for machine learning, coursera. Coursera, video lectures (2012)
 (17) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., Bengio, Y.: Binarized neural networks: Training neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830 (2016)
 (18) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., Bengio, Y.: Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research 18, 1–30 (2018)
 (19) Ioffe, S., Szegedy, C.: Normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
 (20) Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech Report (2009)

(21)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks.
In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)  (22) Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)
 (23) Li, H., De, S., Xu, Z., Studer, C., Samet, H., Goldstein, T.: Training quantized nets: A deeper understanding. In: NIPS, pp. 5813–5823 (2017)
 (24) Li, Y., Yuan, Y.: Convergence analysis of twolayer neural networks with relu activation. In: Advances in Neural Information Processing Systems, pp. 597–607 (2017)
 (25) Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Info. Theory 28, 129–137 (1982)
 (26) Park, E., Ahn, J., Yoo, S.: Weightedentropybased quantization for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5456–5464 (2017)
 (27) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. Tech Report (2017)

(28)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnornet: Imagenet classification using binary convolutional neural networks.
In: European Conference on Computer Vision (ECCV) (2016)  (29) Rosenblatt, F.: The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory (1957)
 (30) Rosenblatt, F.: Principles of neurodynamics. Spartan Book (1962)
 (31) Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2015)
 (32) Tian, Y.: An analytical formula of population gradient for twolayered relu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560 (2017)
 (33) Wang, B., Luo, X., Li, Z., Zhu, W., Shi, Z., Osher, S.J.: Deep neural nets with interpolating function as output activation. arXiv preprint arXiv:1802.00168 (2018)

(34)
Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation.
Proceedings of the IEEE 78(9), 1415–1442 (1990)  (35) Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., Xin, J.: Binaryrelax: A relaxation approach for training deep neural networks with quantized weights. arXiv preprint arXiv:1801.06313; SIAM Journal on Imaging Sciences, to appear (2018)
 (36) Yin, P., Zhang, S., Qi, Y., Xin, J.: Quantization and training of low bitwidth convolutional neural networks for object detection. arXiv preprint arXiv:1612.06052; J. Comput. Math., to appear (2018)
 (37) Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv: 1606.06160 (2016)
Appendix
A. Additional Preliminaries
Lemma 6.
Let be a Gaussian random vector with entries i.i.d. sampled from . Given nonzero vectors and with angle , we have
and
Proof.
The third identity was proved in Lemma A.1 of Lee
. To show the first one, since Gaussian distribution is rotationinvariant, without loss of generality we assume
with , then .We further assume . It is easy to see
which is the probability that
forms an acute angle with both and .To prove the last identity, we use polar representation of 2D Gaussian random variables, where
is the radius and is the angle with and . Then for . Moreover,and
Therefore,
Comments
There are no comments yet.