问题

Keras 初心者（我）不熟悉 GPU 训练中的内存控制方法，近日在训练 VGG16 模型的时候总是出现 ResourceExhaustedError 的错误。一开始百思不得其解，因为服务器上的 GPU 的显存大概有 12 GB，正常应该是够用的。

如果看报错信息，很可能会误以为是卷积层参数问题：

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1024,32,224,224] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: conv1_1/convolution = Conv2D[T=DT_FLOAT, _class=["loc:@gradients/conv1_1/convolution_grad/Conv2DBackpropFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/conv1_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1_1/kernel/read)]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  588.00MiB from gradients/conv1_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer
  12.25MiB from mul_168
  12.25MiB from add_109
  12.25MiB from mul_164
  576.0KiB from mul_108
  576.0KiB from add_69
  576.0KiB from mul_104
  576.0KiB from mul_120
  576.0KiB from add_77
  576.0KiB from mul_116
  576.0KiB from mul_132
  576.0KiB from add_85
  576.0KiB from mul_128
  Remaining 85 nodes with 6.21MiB

  ······
  ······

But 实际上，由于模型的训练需要用到大批量的训练数据。如果使用 model.fit 进行训练，内存中的训练数据会全部送入 GPU 显存中，占满显存空间。
解决方法就是使用 model.fit_generator ，将训练数据分批送入 GPU 中的训练进程。

注意：对于 12GB 显存，每批数据的大小即 batch_size 最好设置为 128 ~ 256，否则太大了还是会报错。

代码示例

训练代码：

# data_loader 是自己写的一个的类，里面包含 model.fit_generator 函数需要的 generator
dl = data_loader()

history = model.fit_generator(dl.data_loader(input_shape=input_shape),    # 输入训练集生成器
  steps_per_epoch=messageConfig['train_batch_steps'],
  validation_steps=1, verbose=1, epochs=50, callbacks=[checkpoint],
  validation_data=dl.data_loader(input_shape=input_shape, option='val'))# 输入验证集生成器

Generator：

注意，生成器的输出格式必须为 (input, output) 或者 ({'input_1': input_array_1, 'input_2': input_arrary_2}, {'output': output_array})


class data_loader:
    ......

    def data_loader(self, input_shape=None, option='train', data_format='cannels_last', *args):

        if input_shape:
            self.input_shape = input_shape

        H, W, C = self.input_shape[0], self.input_shape[1], self.input_shape[2]
        erste, zweite, dritte = H, W, C
        if data_format == 'channels_first':
            erste, zweite, dritte = C, H, W

        if option == 'train':
            if args:
                self.batch_size, self.val_size = args[0], args[1]
            return self.generate_train_batch_index(erste, zweite, dritte)

        elif option == 'test':

            return self.generate_val_batch_index(erste, zweite, dritte)

    def generate_train_batch_index(self, erste, zweite, dritte):
        while True:
            for i in range(self.train_batch_num):
                # 返回一个 batch_size 大小的 (train_batch_image, train_batch_labels)
                data = self.get_next_train_batch(i, erste, zweite, dritte)    
                yield data

经过这样一番折腾，终于可以继续和模型愉快地玩耍了。

参考

Tensorflow Deep MNIST: Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28]

A concrete example for using data generator for large datasets such as ImageNet #1627