推荐|目标检测 YOLOv5 anchor设置

目标检测 YOLOv5 anchor设置

1 anchor的存储位置

1.1 yaml配置文件中例如 models/yolov5s.yaml

# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32
1
2
3
4
5

1.2 模型文件中

读取方法

import torch
from models.experimental import attempt_load

model = attempt_load('./weights/best.pt', map_location=torch.device('cpu'))
m = model.module.model[-1] if hasattr(model, 'module') else model.model[-1]
print(m.anchor_grid)
1
2
3
4
5
6

输出

tensor([[[[[[ 10.,  13.]]],
          [[[ 16.,  30.]]],
          [[[ 33.,  23.]]]]],

        [[[[[ 30.,  61.]]],
          [[[ 62.,  45.]]],
          [[[ 59., 119.]]]]],

        [[[[[116.,  90.]]],
          [[[156., 198.]]],
          [[[373., 326.]]]]]])
1
2
3
4
5
6
7
8
9
10
11

2 如何自动计算

2.1 命令行

训练时命令行添加–noautoanchor，表示不计算anchor，直接使用配置文件里的默认的anchor，不加该参数表示训练之前会自动计算。

train.py

if not opt.noautoanchor:
    check_anchors(dataset, model=model, thr=hyp['anchor_t'], imgsz=imgsz)
1
2

参数dataset代表的是训练集，hyp[‘anchor_t’]是从配置文件hpy.scratch.yaml读取的超参数

2.2 计算方法

2.2.1 anchor是什么

人工标注的边框（bounding box，有时简写bbox 或者 box）是人告诉机器的正确答案叫ground truth bounding box，简写 gt bbox或者gt box，表示已知正确的意思。
程序生成的一堆box，想要预测正确的gt box，这些生成的box叫anchor。

ground truth其他翻译包括基准真相，地面实况，上帝真相，地面真相等。这个时候不翻译就是好的翻译。
在地质/地球科学中创造了“ground truth”一词，以描述通过在现场上进行数据并检查“地面”的验证。它已在其他领域采用以表达“已知”要正确的数据概念
术语“地面真相”是在地质/地球科学中创造出来的，用来描述通过实地和“在地面上”进行检查来验证数据。它已经被其他领域采用来表示“已知”是正确的数据的概念。

2.2.2 两个指标

bpr（best possible recall）
aat（anchors above threshold）
1
2

当配置文件中的anchor计算bpr（best possible recall）小于0.98时才会重新计算anchor。
best possible recall最大值1，如果bpr小于0.98，程序会根据数据集的label自动学习anchor的尺寸

check_anchors函数里包着一个函数

def metric(k):  # compute metric
    r = wh[:, None] / k[None]
    x = torch.min(r, 1. / r).min(2)[0]  # ratio metric
    best = x.max(1)[0]  # best_x
    aat = (x > 1. / thr).float().sum(1).mean()  # anchors above threshold
    bpr = (best > 1. / thr).float().mean()  # best possible recall
    return bpr, aat
1
2
3
4
5
6
7

gt box的shape是(N,2)，N是gt box总的个数，2是宽和高，程序用metric函数中的wh存储所有gt box的宽高。
在YOLOv5中有9种anchor，metric函数中的k就是通过配置（例如 models/yolov5s.yaml）读取的anchor。
metric函数根据k和wh来计算bpr,aat两个指标。

以下用gt代表wh，anchor代表k
wh（gt）的shape是[N, 2]表示N个gt box和和宽高这2个维度
k（anchor）的shape是[9,2]表示YOLOv5中的9种anchor和宽高这2个维度

r = wh[:, None] / k[None] 表示
先扩展维度
wh[:, None]的shape是[N, 2]变[N, 1, 2]
k[None]的shape是[9,2]变[1, 9, 2]

gt_height / anchor_height
gt_width / anchor_width
1
2

有可能大于1，也可能小于等于1
r的shape是[N,9,2]。9是YOLOv5中的9种anchor

x = torch.min(r, 1. / r).min(2)[0]
无论r大于1，还是小于等于1最后统一结果都要小于1
x的shape[N,9]
best的shape是[N]

3. 这些Anchor的配置在程序中是如何使用的

3.1 在程序中的anchors的变化

tensor([[[ 1.25000,  1.62500],  #10,13, 16,30, 33,23 每一个数除以8
         [ 2.00000,  3.75000],
         [ 4.12500,  2.87500]],

        [[ 1.87500,  3.81250], #30,61, 62,45, 59,119 每一个数除以16
         [ 3.87500,  2.81250],
         [ 3.68750,  7.43750]],

        [[ 3.62500,  2.81250], #116,90, 156,198, 373,326 每一个数除以32
         [ 4.87500,  6.18750],
         [11.65625, 10.18750]]])
1
2
3
4
5
6
7
8
9
10
11

3.2 targets内容的变化

可以看build_targets函数，通过一个例子查看targets内容
每个feature map有3种anchor，一共3个feature map

3.2.1 原始图像标注内容

一共有2张图像
第0张图像的人工标注结果
第0列表示类别，后面4个表示坐标

45 0.479492 0.688771 0.955609 0.5955
1

第1张图像的人工标注结果

23 0.770336 0.489695 0.335891 0.697559
1

3.2.2 targets原始内容

image_id, class_id, x_center_norm, y_center_norm, width_norm, height_norm

[[ 0.00000, 45.00000,  0.47949,  0.64158,  0.95561,  0.44662],
 [ 1.00000, 23.00000,  0.77034,  0.49314,  0.33589,  0.46431]]
1
2

3.2.3 targets扩展维度之后

image_id, class_id, x_center_norm, y_center_norm, width_norm, height_norm, anchor_index

[[[ 0.00000, 45.00000,  0.47949,  0.64158,  0.95561,  0.44662,  0.00000],
  [ 1.00000, 23.00000,  0.77034,  0.49314,  0.33589,  0.46431,  0.00000]],

[[ 0.00000, 45.00000,  0.47949,  0.64158,  0.95561,  0.44662,  1.00000],
 [ 1.00000, 23.00000,  0.77034,  0.49314,  0.33589,  0.46431,  1.00000]],

[[ 0.00000, 45.00000,  0.47949,  0.64158,  0.95561,  0.44662,  2.00000],
 [ 1.00000, 23.00000,  0.77034,  0.49314,  0.33589,  0.46431,  2.00000]]]
1
2
3
4
5
6
7
8

image_id：图像ID，一共2张图像不是0就是1，第0列
class_id：类别ID，看下面的列子看到标注文件里的23，45与target里的内容一致，第1列
x_center_norm, y_center_norm, width_norm, height_norm：表示框的坐标，是规范化的结果.关于详细的坐标表示方式看这里
anchor_index：anchor indices，也就0，1，2，最后一列
一个feature map有3种anchor，2张图像一共2个目标，再加上7列（image_id, class_id, x_center_norm, y_center_norm, width_norm, height_norm, anchor_index）
也就是targets的shape是([3, 2, 7])，这里不采用mosaic等数据增强，这里的2是一个batch的所有目标个数的共和。如果是2张图像一共10个目标那么shape就是[3, 10, 7]

下面就到了循环处理每一个feature map

3.3 映射到80 * 80 feature map

anchor[0]包括三种anchor

[[1.25000, 1.62500],
[2.00000, 3.75000],
[4.12500, 2.87500]]
1
2
3

p[0]的shape [2, 3, 80, 80, 85]
根据p[0]得到gain
gain的内容是 [ 1., 1., 80., 80., 80., 80., 1.]

映射到80 * 80 feature map的结果（t = targets * gain ）

[[[ 0.00000, 45.00000, 38.35936, 51.32626, 76.44872, 35.73000,  0.00000],
[ 1.00000, 23.00000, 61.62688, 39.45126, 26.87128, 37.14502,  0.00000]],

[[ 0.00000, 45.00000, 38.35936, 51.32626, 76.44872, 35.73000,  1.00000],
[ 1.00000, 23.00000, 61.62688, 39.45126, 26.87128, 37.14502,  1.00000]],

[[ 0.00000, 45.00000, 38.35936, 51.32626, 76.44872, 35.73000,  2.00000],
[ 1.00000, 23.00000, 61.62688, 39.45126, 26.87128, 37.14502,  2.00000]]]
1
2
3
4
5
6
7
8

3.4 映射到40 * 40 feature map

anchor[1]包括三种anchor

[[1.87500, 3.81250],
[3.87500, 2.81250],
[3.68750, 7.43750]]
1
2
3

p[1]的shape [2, 3, 40, 40, 85]
根据p[1]得到gain
gain的内容是[ 1., 1., 40., 40., 40., 40., 1.]
t = targets * gain

映射到40 * 40 feature map的结果（t = targets * gain ）

[[[ 0.00000, 45.00000, 19.17968, 25.66313, 38.22436, 17.86500,  0.00000],
 [ 1.00000, 23.00000, 30.81344, 19.72563, 13.43564, 18.57251,  0.00000]],

[[ 0.00000, 45.00000, 19.17968, 25.66313, 38.22436, 17.86500,  1.00000],
 [ 1.00000, 23.00000, 30.81344, 19.72563, 13.43564, 18.57251,  1.00000]],

[[ 0.00000, 45.00000, 19.17968, 25.66313, 38.22436, 17.86500,  2.00000],
 [ 1.00000, 23.00000, 30.81344, 19.72563, 13.43564, 18.57251,  2.00000]]]
1
2
3
4
5
6
7
8

3.5 映射到20 * 20 feature map

anchor[2]包括三种anchor

[[ 3.62500,  2.81250]
[ 4.87500,  6.18750],
[11.65625, 10.18750]]
1
2
3

p[2]的shape是 [2, 3, 20, 20, 85]
根据p[2]得到gain
gain的内容是[ 1., 1., 20., 20., 20., 20., 1.]

映射到20 * 20 feature map的结果（t = targets * gain ）

[[[ 0.00000, 45.00000,  9.58984, 12.83157, 19.11218,  8.93250,  0.00000],
 [ 1.00000, 23.00000, 15.40672,  9.86281,  6.71782,  9.28625,  0.00000]],

[[ 0.00000, 45.00000,  9.58984, 12.83157, 19.11218,  8.93250,  1.00000],
 [ 1.00000, 23.00000, 15.40672,  9.86281,  6.71782,  9.28625,  1.00000]],

[[ 0.00000, 45.00000,  9.58984, 12.83157, 19.11218,  8.93250,  2.00000],
 [ 1.00000, 23.00000, 15.40672,  9.86281,  6.71782,  9.28625,  2.00000]]]
1
2
3
4
5
6
7
8

如何还要了解anchor看这里