1. 语义分割

图像语义分割是对每个像素值做分类，它既要检测出目标的边界，又要识别出目标的类别。使用深度学习做语义分割时，浅层网络提取的特征较形象，细节保留更多，适合检测；深层网络提取的特征较抽象，表达能力更强，适合分类。所以，设计语义分割算法时要平衡这个矛盾。

语义分割

DeepLab 系列直到 v3+ 都一直使用像素级的交叉熵作为损失函数。

2. 条件随机场 CRF

图像分割、检测、重建等任务会生成一个和原图尺寸相同的预测结果。条件随机场可以根据原始图像的像素值调整预测结果，使之更合理。
条件随机场可用于 pixel-wise 的 label 预测。它把像素的 label 作为随机变量，像素与像素间的关系作为边，构成一个条件随机场。这些像素来自输入图像，是全局观测值，条件随机场用观测值对 label 建模，从而优化 label。

import numpy as np
import cv2
import pydensecrf.densecrf as dcrf
from pydensecrf.utils import unary_from_labels, create_pairwise_bilateral, create_pairwise_gaussian


def create_CRF2D(src, labels, N, zero_uncertainty):
    assert (src.shape[0] * src.shape[1] == len(labels))
    model = dcrf.DenseCRF2D(src.shape[1], src.shape[0], N)
    unary = unary_from_labels(labels, N, gt_prob=0.2, zero_unsure=zero_uncertainty is True)
    model.setUnary(unary)
    model.addPairwiseBilateral(sxy=(80, 80),
                               srgb=(13, 13, 13),
                               rgbim=src,
                               compat=10,
                               kernel=dcrf.DIAG_KERNEL,
                               normalization=dcrf.NORMALIZE_SYMMETRIC)
    return model


def create_CRF(src, labels, N, zero_uncertainty):
    assert (src.shape[0] * src.shape[1] == len(labels))
    model = dcrf.DenseCRF(src.shape[1] * src.shape[0], N)
    unary = unary_from_labels(labels, N, gt_prob=0.5, zero_unsure=zero_uncertainty is True)
    model.setUnaryEnergy(unary)
    feats1 = create_pairwise_gaussian(sdims=(3, 3), shape=src.shape[:2])
    model.addPairwiseEnergy(feats1,
                            compat=8,
                            kernel=dcrf.DIAG_KERNEL,
                            normalization=dcrf.NORMALIZE_SYMMETRIC)
    feats2 = create_pairwise_bilateral(sdims=(80, 80), schan=(13, 13, 13), img=src, chdim=2)
    model.addPairwiseEnergy(feats2,
                            compat=10,
                            kernel=dcrf.DIAG_KERNEL,
                            normalization=dcrf.NORMALIZE_SYMMETRIC)
    return model


def CRFs(src_path, cursory_path, use_2d=False, zero_uncertainty=True):
    # 1.把三通道的cursory转换为单通道，以获取labels。
    cursory = cv2.imread(cursory_path)
    cursory = cv2.cvtColor(cursory, cv2.COLOR_BGR2RGB)
    cursory = cursory.astype(np.uint32)
    cursory = cursory[:, :, 0] + (cursory[:, :, 1] << 8) + (cursory[:, :, 2] << 16)
    # 2.从小到大排序后的序列，原序列每个位置的元素在排序后的序列中的位置。
    colors, labels = np.unique(cursory, return_inverse=True)
    if 0 not in colors:
        zero_uncertainty = False
    if zero_uncertainty:
        colors = colors[1:]
    # 3.把单通道的cursory转换成三元组。经过第2步，此时cursory中任意两个三元组都不相同。
    cursory = np.empty((len(colors), 3), np.uint8)
    cursory[:, 0] = (colors & 0x0000FF)
    cursory[:, 1] = (colors & 0x00FF00) >> 8
    cursory[:, 2] = (colors & 0xFF0000) >> 16
    # 4.获取粗略预测结果中的类别总数。
    N = len(set(labels.flat))
    if zero_uncertainty:
        N -= 1
    # 5.读取原始图像。
    src = cv2.imread(src_path)
    # src = cv2.cvtColor(src, cv2.COLOR_BGR2RGB)
    # 6.创建模型。
    if use_2d:
        model = create_CRF2D(src, labels, N, zero_uncertainty)
    else:
        model = create_CRF(src, labels, N, zero_uncertainty)
    # 7.推理。
    Q = model.inference(10)
    MAP = np.argmax(Q, axis=0)
    MAP = cursory[MAP, :]
    res = MAP.reshape(src.shape)
    return res


if __name__ == '__main__':
    # 自然图像。
    path1 = '1.jpg'
    # 网络预测的需要CRF精修的图。
    path2 = '2.jpg'
    dst = CRFs(path1, path2)
    cv2.imwrite('dst.jpg', dst)

3. v1

主干网络基于 VGG-16。使用空洞卷积避免最后的两个下采样，这使得 output_stride 最大为 8，保留较多细节。
使用多尺度分类器和 CRF 增强效果，发现单独使用 CRF 比单独使用多尺度分类器效果更好。

上图展示的是使用多尺度的 DeepLab v1，使用 CRF 的 DeepLab v1 较简单，不再展示。

4. v2

分别试验了使用 VGG-16 和 ResNet-101 作为主干网络。仍然使用空洞卷积减少下采样次数，output_stride 最大为 8 或 16。
发明了 ASPP 结构，由空洞率为 $6\times (1,2,3,4)$ 的四个卷积并联而成。

5. v3

5.1 级联的空洞卷积

级联空洞卷积的主干网络。使用 ResNet 作为主干网络，如果使用 ResNet-18 或 34，去掉主干网络后面的一个下采样，这时的 output_stride=8，后面级联两个空洞卷积提取多尺度特征；如果使用 ResNet-50 及以上，去掉主干网络后面的两个下采样，这时的 output_stride=16，后面级联一个空洞卷积提取多尺度特征。

5.2 改进的 ASPP

ASPP 中添加了 BN 层，输入维度是 (512, L/8)，输出维度是 (256, L/8)。ASPP 每个卷积后都紧跟 BN 层和 ReLU 激活函数层，由 5 个结构并联而成：

一个卷积核边长为 1 的卷积。
三个空洞率为 $6\times (1,2,3)$ ，卷积核边长为 3 的卷积。
获取 image-level 特征：一个全局平均池化得到维度为 (512, 1) 的特征，一个 CBR(1, 1) 得到维度为 (256, 1) 的特征，一个上采样得到维度为 (256, L/8) 的特征。

上面 5 个结构得到 5 个维度为 (256, L/8) 的特征，把它们沿通道维度叠加得到 (1280, L/8) 的特征。

import torch
import numpy as np
from torch import nn

class CBR(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, padding=0, dilation=1, groups=1):
        if padding == 0:
            padding = (kernel_size - 1) // 2
        super(CBR, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, dilation, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.ReLU6(inplace=True)
        )

class ASPPPooling(nn.Sequential):
    def __init__(self, in_channels, out_channels):
        super(ASPPPooling, self).__init__(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
            nn.Upsample(size=(64, 64), mode='nearest')
        )

class ASPP(nn.Module):
    def __init__(self, in_ch, out_ch, rates):
        super(ASPP, self).__init__()
        self.stages = nn.Module()
        self.stages.add_module("c0", CBR(in_ch, out_ch, 1, 1, 0, 1))
        for i, rate in enumerate(rates):
            self.stages.add_module("c{}".format(i + 1), CBR(in_ch, out_ch, 3, 1, padding=rate, dilation=rate), )
        self.stages.add_module("imagepool", ASPPPooling(in_ch, out_ch))

    def forward(self, x):
        return torch.cat([stage(x) for stage in self.stages.children()], dim=1)

if __name__ == '__main__':
    input = torch.tensor(np.random.random(size=(2, 2048, 64, 64)), dtype=torch.float32)
    net = ASPP(2048, 256, [6, 12, 18])
    output = net(input)
    print(output.shape)

上面的 Upsample 把边长为 1 的特征上采样到 64，batch size 必须大于 1，不然会报错。

6. v3+

v3 把 output_stride 为 8 或 16 的预测结果直接上采样到原图尺寸。
v3+ 固定 output_stride 为 16。使用编解码器结构，把 output_stride 为 16 的预测结果上采样到 output_stride 为 4，然后与编码器中相同尺寸的特征叠加，再上采样到原图尺寸。
还使用 Xception 网络和深度可分离卷积。

V3+