前言

在最近的科研生活中，我时常遇上很有意思的小问题，在此我开一个专栏专门记录下我已经解决的和未曾解决的小灵感。

问题说明

小灵感：在一块平面上，有$n$个振动源在产生振动，根据物理定律这些振动点会在平面上产生振动分布（激励分布），我将振幅数据采集，绘制出类似于图1中的分布图。此时，我想将振动分布图作为输入数据，振动点源的位置坐标作为输出数据，尤其需要注意的是，我不在模型中设置最后输出的点源的数量，也就是说模型最后输出的坐标的数量不是一成不变的。

我认为有意思的地方是——模型在一方面在做类似对点源数量的回归任务，输出预测的点源数量，一方面在分类的基础上进行回归任务，输出激励点源的位置坐标。

灵感假设

要在一个模型中做到这些是较为困难的，因此，我设想是否能组合两个模型构建为一个网络进行训练呢？假设$Model A$作为实现输出点源数量的部分，那么在$Model B$的输出层中，最后的输出层（线性层）的维度则由$Model A$（输出的点源数量）控制。这样就简单构建出了一个组合模型网络架构。

PS：经过后续的查阅资料，原来本文的这种组合模型网络架构已经有较多的相关工作了，常在多任务学习中运用，被称为级联神经网络。

如图2所示，能够更加清晰的看清楚模型的运作机制。

灵感小验证

下面是代码部分，我先采用了简单的线性层进行实验，测试我的小灵感是否能够实现。

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP4Number(nn.Module):
    def __init__(self, channel=3, hidden_dim=64, output_dim=1):
        super().__init__()
        self.fc1 = nn.Linear(channel, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        x = torch.round(x).clamp(min=1)
        return x

class MLP4Coordinate(nn.Module):
    def __init__(self, channel=10, hidden_dim=64, output_dim=1):
        super().__init__()
        self.fc1 = nn.Linear(channel, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.number = output_dim

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        x = x.view(self.number, 3)
        return x

class MLP4TwoTask(nn.Module):
    def __init__(self, input_channels=256, hidden_dim=64, output_dim=1):
        super().__init__()
        self.input = input_channels ** 2
        self.hidden = hidden_dim
        self.number = MLP4Number(self.input, hidden_dim, output_dim)

    def forward(self, x):
        c = torch.tensor([])
        x = x.view(x.size(0), -1)
        n = self.number(x)
        for i in n:
            self.coordinate = MLP4Coordinate(self.input, self.hidden, output_dim=int(i))
            a = self.coordinate(x)
            c = torch.cat((c, a), dim=0)
        return n, c

if __name__ == '__main__':
    model = MLP4TwoTask()
    imgs = torch.randn(3, 256, 256)
    output = model(imgs)
    print(output)

下面是经过ChatGPT润色后的代码实现。

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP4Number(nn.Module):
    def __init__(self, in_dim: int, hidden_dim: int = 64, output_dim: int = 1):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        输入：
            x: Tensor，shape = (batch_size, in_dim)
        输出：
            Tensor，shape = (batch_size, 1)，经过 round 并 clamp(min=1)
        """
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        x = torch.round(x).clamp(min=1)
        return x.squeeze(-1)  # 返回 shape = (batch_size,)


class MLP4Coordinate(nn.Module):
    def __init__(self, in_dim: int, hidden_dim: int = 64, output_dim: int = 1):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim * 3)
        # output_dim 表示“要预测几个三维点”，最终输出会 reshape 成 (output_dim, 3)

    def forward(self, x: torch.Tensor, num_points: int) -> torch.Tensor:
        """
        输入：
            x: Tensor，shape = (1, in_dim) 或 (batch=1, in_dim)  —— 单个样本
            num_points: int，表示要预测多少个 (x,y,z) 坐标点
        输出：
            Tensor，shape = (num_points, 3)
        """
        x = F.relu(self.fc1(x))
        x = self.fc2(x)  # shape = (1, num_points*3)
        x = x.view(num_points, 3)  # reshape 成 (num_points, 3)
        return x


class MLP4TwoTask(nn.Module):
    def __init__(self, input_size: int = 256, hidden_dim: int = 64):
        """
        input_size: 图像边长（如 256），网络的输入是 (batch, input_size, input_size)
        hidden_dim: MLP 隐藏层维度
        """
        super().__init__()
        self.input_dim = input_size * input_size  # 256*256
        self.hidden_dim = hidden_dim

        # 阶段一：预测要输出多少个点
        self.number_net = MLP4Number(in_dim=self.input_dim,
                                    hidden_dim=hidden_dim,
                                    output_dim=1)  # 单个数值

        # 注意：这里并不预先创建 MLP4Coordinate，因为 output_dim（坐标点数）是动态的。
        # 我们会在 forward 里按需实例化一个局部的 MLP4Coordinate

    def forward(self, x: torch.Tensor) -> (torch.Tensor, torch.Tensor):
        """
        输入：
            x: Tensor，shape = (batch_size, input_size, input_size)
        输出：
            num_list: Tensor，shape = (batch_size,) —— 每个样本预测的点的数量 (>=1)
            coords_all: Tensor，shape = (sum(num_list), 3) —— 把所有样本的坐标拼在一起
        """
        batch_size = x.size(0)
        device = x.device

        # 先把 (batch, H, W) -> (batch, H*W)
        x_flat = x.view(batch_size, -1)  # shape = (batch_size, input_dim)

        # 阶段一：预测每个样本要多少个点 (>=1)
        num_list = self.number_net(x_flat)  # shape = (batch_size,)
        # num_list 内的值是浮点取整后保证 ≥1，比如 [3., 1., 5., ...]

        coords_list = []
        for idx in range(batch_size):
            num_i = int(num_list[idx].item())  # 取出第 idx 个样本需要的点数 (Python int)

            # 注意：这里用临时变量，不要写成 self.coord_net，否则会被注册到 Module 里
            coord_net = MLP4Coordinate(in_dim=self.input_dim,
                                       hidden_dim=self.hidden_dim,
                                       output_dim=num_i).to(device)

            # 取出第 idx 个样本的扁平化向量，形状要是 (1, input_dim)，方便 MLP4Coordinate 接受
            single_x = x_flat[idx].unsqueeze(0)  # shape = (1, input_dim)

            # 得到坐标：shape = (num_i, 3)
            coords_i = coord_net(single_x, num_points=num_i)
            coords_list.append(coords_i)

        # 把所有样本预测出来的坐标一起 cat，最终形状 = (sum(num_list), 3)
        coords_all = torch.cat(coords_list, dim=0)

        return num_list, coords_all


if __name__ == "__main__":
    # 示例运行
    model = MLP4TwoTask(input_size=256, hidden_dim=64)  # 如果有 GPU，就 .cuda()
    imgs = torch.randn(3, 256, 256)
    num_out, coords_out = model(imgs)
    print("Num per sample:", num_out)            # e.g. tensor([2., 5., 1.], device='cuda:0')
    print("All coords shape:", coords_out.shape) # (2+5+1, 3) = (8,3) 这种形式

验证实验

接下来我们将级联MLP改变为级联CNN，下面开始实验。

构建数据集

在本文中的所有数据都是采用了数值模拟的办法，下面是我们简要构建数据集的代码。

import numpy as np
import matplotlib.pyplot as plt
import os
import random
from matplotlib import cm  # 导入颜色映射模块

# 创建输出目录
output_dir = "data"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# 波的参数
k = 0.2  # 波的频率
speed = 0.2  # 波的传播速度
nums = 100  # 生成数据数量
resolution = 400  # 图像分辨率

# 网格点
x = np.linspace(-10, 10, resolution)
y = np.linspace(-10, 10, resolution)
X, Y = np.meshgrid(x, y)

# 随机生成激励点位置
def generate_random_sources(num_sources):
    sources = []
    for _ in range(num_sources):
        x_pos = random.uniform(0, 1)  # 随机 x 坐标
        y_pos = random.uniform(0, 1)  # 随机 y 坐标
        sources.append(np.array([x_pos, y_pos]))
    return sources

# 生成波的幅度
def generate_wave(sources, frame):
    Z = np.zeros_like(X)  # 初始化波的幅度
    for center in sources:
        R = np.sqrt((X - center[0]) ** 2 + (Y - center[1]) ** 2)
        Z += np.sin(k * R - speed * frame)  # 叠加每个激励点的波
    return Z

# 生成并保存数据（添加高斯噪声）
def generate_and_save_data(num_sources, output_dir="route_data"):
    # 创建子目录
    npy_dir = os.path.join(output_dir, "npy")
    pic_dir = os.path.join(output_dir, "pic")
    os.makedirs(npy_dir, exist_ok=True)
    os.makedirs(pic_dir, exist_ok=True)

    for frame in range(nums):
        # 随机生成激励点位置
        sources = generate_random_sources(num_sources)

        # 生成波的幅度
        Z = generate_wave(sources, frame)

        # 将波形数据映射为 RGB 图像数据
        norm_z = (Z - Z.min()) / (Z.max() - Z.min())
        rgba_data = cm.jet(norm_z)
        rgb_data = rgba_data[..., :3]

        # 添加高斯噪声
        noise_std = 0.1  # 噪声强度，根据需要调整
        noise = np.random.normal(0, noise_std, rgb_data.shape)
        noisy_rgb = np.clip(rgb_data + noise, 0, 1)  # 限制在有效范围

        # 保存带噪声的npy文件
        npy_filename = os.path.join(npy_dir, f"{num_sources}_frame_{frame:03d}.npy")
        np.save(npy_filename, noisy_rgb)

        # 可视化并保存图片
        img_filename = os.path.join(pic_dir, f"{num_sources}_frame_{frame:03d}.png")
        plt.imshow(noisy_rgb)
        # 标记激励点
        for source in sources:
            x_idx = np.argmin(np.abs(x - source[0]))
            y_idx = np.argmin(np.abs(y - source[1]))
            plt.scatter(x[x_idx], y[y_idx], color='white', s=100, marker='x', alpha=0.6)  # 调整透明度
        plt.axis('off')
        plt.savefig(img_filename, bbox_inches='tight', pad_inches=0)
        plt.clf()

    print(f"Generated route_data for {num_sources} sources with Gaussian noise.")
    print(f"Saved route_data to {output_dir}")

# 生成带噪声的数据
generate_and_save_data(1)
generate_and_save_data(2)
generate_and_save_data(3)

在上述的代码中，我们设置波的频率为$k=0.2$，传播速度为0.2，每组数量为100，高斯噪声为0.1，每组的图像分辨率为$400 \times 400$，随机点的x和y的坐标范围均在$[0,1]$。其中，由于神经网络的本质其实就是一种函数映射$f(\cdot)$，所以我们可以随意构建具体坐标（因为只是要验证idea的可行性，无需太过严谨），通过构建在x和y的坐标范围内的坐标数据并将其作为文件名对Data进行重命名，得到最后的数据集。

PS（微操）：无论坐标数据的范围，其实都可以将其归一化后输入网络中进行学习，前提是需要提取纪录下数据中的最大值与最小值用来还原回原坐标。

我们在训练过程中构建出Data内容为（Images，Coords）的数据集，其中Images为（batch，3，H，W），Coords为（batch，coords），Images Size为$400 \times 400$。

下面是构建数据集部分的代码：

class NpyDataset(Dataset):
    def __init__(self, data_dir: str):
        super().__init__()
        self.file_paths = glob.glob(os.path.join(data_dir, "*.npy"))
        self.file_paths.sort()

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx: int):
        file_path = self.file_paths[idx]
        filename = os.path.basename(file_path)

        # 加载 .npy 文件内容
        np_array = np.load(file_path)  # e.g. shape = (400, 400, 3)

        # 如果是三通道(H, W, 3)，转成 (3, H, W)
        if np_array.ndim == 3:
            # 假设最后一维是 channel
            data_tensor = torch.from_numpy(np_array).permute(2, 0, 1).float()
        elif np_array.ndim == 2:
            # 如果是单通道(H, W)，加一个通道维 (1, H, W)
            data_tensor = torch.from_numpy(np_array).unsqueeze(0).float()
        else:
            # 如果数据有其他维度，就根据实际情况处理
            raise ValueError(f"Unsupported npy shape: {np_array.shape}")

        # 从文件名解析坐标（不变）
        name_without_ext = os.path.splitext(filename)[0]
        parts = name_without_ext.split("_")
        coords = []
        for i in range(0, len(parts), 2):
            x = float(parts[i]);
            y = float(parts[i + 1])
            coords.append([x, y])
        coord_tensor = torch.tensor(coords, dtype=torch.float32)

        return data_tensor, coord_tensor

def variable_collate_fn(batch):
    """
    自定义 collate_fn，用于处理每个样本 label (coord) 长度不同的情况。
    batch 是 list of (data_tensor, coord_tensor)。

    返回:
      data_batch: Tensor, 将 data_tensor 堆叠，形状 = (batch_size, ...)
      coords_list: List of Tensors, 每个 coord_tensor 形状 = (Ni, 2)
    """
    data_list, coords_list = zip(*batch)
    data_batch = torch.stack(data_list, dim=0)
    return data_batch, list(coords_list)

构建级联CNN

如图3所示，我们先采用卷积操作提取图片的特征信息，然后将特征信息输入线性层进行进一步的学习和变化，最后输出我们想要得到的数据。其中，在Feature Extraction中，为了使得网络层数能够更多，我们采用了由残差连接构成的残差块，经过堆叠构成总体的卷积提取部分。

PS：由于图3中模型的表达能力有限，下方的代码仅仅只表示模型搭建的基本思路，在复现过程中请根据自己的需求对模型进行模块的添加，本文中的所有代码均放置于Github中，有需要浮现者可以前往我的Github上查看完整的代码。（https://github.com/MAOJIUTT/Cascade-Neural-Networks）

下面是模型的代码

class ResBlock(nn.Module):
    def __init__(self, channels: int):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out, inplace=True)
        out = self.conv2(out)
        out = self.bn2(out)
        out += identity
        out = F.relu(out, inplace=True)
        return out


class CNN4TwoStage(nn.Module):
    def __init__(
        self,
        in_channels: int = 1,
        base_channels: int = 32,
        num_resblocks: int = 2,
        max_points: int = 10,
        hidden_dim: int = 128
    ):
        super().__init__()
        self.in_channels = in_channels
        self.base_channels = base_channels
        self.max_points = max_points
        self.hidden_dim = hidden_dim

        # backbone
        self.conv_start = nn.Sequential(
            nn.Conv2d(in_channels, base_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(base_channels),
            nn.ReLU(inplace=True)
        )
        res_blocks = []
        for _ in range(num_resblocks):
            res_blocks.append(ResBlock(base_channels))
        self.residual_layers = nn.Sequential(*res_blocks)
        self.global_pool = nn.AdaptiveAvgPool2d((1,1))
        self.fc_feature = nn.Linear(base_channels, hidden_dim)

        # Number Head
        self.number_fc = nn.Linear(hidden_dim, 1)

        # Coordinate Head (改为 max_points * 2)
        self.coord_fc = nn.Linear(hidden_dim, max_points * 2)

    def forward(self, x: torch.Tensor) -> (torch.Tensor, torch.Tensor):
        """
        输入：
            x: (batch, in_channels, H, W)
        输出：
            num_list: (batch,) 整数型，预测的每张图要输出几个点
            coords_all: (sum_i Ni, 2) 所有样本拼接后的 (x,y) 坐标
        """
        batch_size = x.size(0)
        device = x.device

        # backbone 提取特征
        out = self.conv_start(x)           # (batch, base, H, W)
        out = self.residual_layers(out)    # (batch, base, H, W)
        out = self.global_pool(out)        # (batch, base, 1, 1)
        out = out.view(batch_size, -1)     # (batch, base)
        feat = F.relu(self.fc_feature(out))# (batch, hidden_dim)

        # Number Head
        num_pred = self.number_fc(feat)    # (batch,1)
        num_pred = torch.round(num_pred).clamp(min=1)  # 保证 ≥1
        num_list = num_pred.squeeze(-1).long()         # (batch,)

        # Coordinate Head → (batch, max_points*2)
        coord_pred = self.coord_fc(feat)   # (batch, max_points*2)

        coords_list = []
        for i in range(batch_size):
            n_i = int(num_list[i].item())
            if n_i > self.max_points:
                n_i = self.max_points
            # 取前 n_i*2 个数，reshape 为 (n_i,2)
            coords_i = coord_pred[i, : n_i * 2].view(n_i, 2)
            coords_list.append(coords_i)

        # 把所有样本的 coords_i 在第 0 维拼起来 -> (sum_i n_i, 2)
        coords_all = torch.cat(coords_list, dim=0)
        return num_list, coords_all

阶段性训练（冻结-微调策略）

由于级联CNN是多任务结构，是一个典型的多任务学习模型（Multi-Task Learning，MTL），拥有两个部分——主干网络（Backbone）和任务分支（Head）。主干网络用于提取共享特征，任务分支则是完成对应需要完成的任务。其中，在本文的多任务结构中有两个关联的学习阶段，阶段1 为 Number Head 预测的是每张图中目标的个数；阶段2 为 Coordinate Head 预测的是目标的坐标，而它依赖于阶段1 Number Head的预测结果。因此，可以认为如果 Number Head 预测不准，那么 Coordinate Head 的训练会受影响（如点数不对导致配对错误）。所以，我们可以采用如下两段式训练策略。

第一阶段：冻结 Coordinate Head ，只优化数量预测，代码如下。

def train_number_head(model, loader, optimizer, epochs=10):
    model.train()

    # 冻结坐标预测头
    for param in model.coord_fc.parameters():
        param.requires_grad = False

    mse_num = nn.MSELoss()

    for epoch in range(epochs):
        epoch_loss = 0.0
        for x_batch, coords_list in loader:
            x_batch = x_batch.to(device)
            num_target = torch.tensor(
                [coords.shape[0] for coords in coords_list],
                dtype=torch.float32, device=device
            )

            num_pred, _ = model(x_batch)

            loss = mse_num(num_pred.float(), num_target)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        print(f"[Number Head] Epoch [{epoch+1}/{epochs}]  Loss: {epoch_loss:.6f}")

第二阶段：解冻 Coordinate Head ，可选冻 Number Head ，代码如下。

def train_coord_head(model, loader, optimizer, epochs=10, lambda_coord=1.0, freeze_number_head=False):
    model.train()

    # 解冻坐标头
    for param in model.coord_fc.parameters():
        param.requires_grad = True

    # 可选：冻结数量预测头
    if freeze_number_head:
        for param in model.number_fc.parameters():
            param.requires_grad = False

    mse_num = nn.MSELoss()
    mse_coord = nn.MSELoss()

    for epoch in range(epochs):
        total_loss_num, total_loss_coord = 0.0, 0.0

        for x_batch, coords_list in loader:
            x_batch = x_batch.to(device)

            num_target = torch.tensor(
                [coords.shape[0] for coords in coords_list],
                dtype=torch.float32, device=device
            )

            coords_target_all = torch.cat(
                [coords.to(device) for coords in coords_list], dim=0
            )

            num_pred, coords_pred_all = model(x_batch)

            loss_num = mse_num(num_pred.float(), num_target)
            loss_coord = mse_coord(coords_pred_all, coords_target_all)

            total_loss = loss_num + lambda_coord * loss_coord

            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

            total_loss_num += loss_num.item()
            total_loss_coord += loss_coord.item()

        print(f"[Coord Head] Epoch [{epoch+1}/{epochs}]  NumLoss: {total_loss_num:.6f}  CoordLoss: {total_loss_coord:.6f}")

但是在训练过程中计算Loss值时，仍然会遇上点数不对导致配对错误的问题，本篇Blog目前未能解决该问题，所以在本文中我所采用的解决方法是分段式训练模型，先将模型中预测激励点数量的模块部分训练到90%以上的准确率，然后再在这个基础上训练预测坐标部分的模型，这样就能够保证一定的模型预测准确率。

PS：本Blog中未解决的问题部分我会在暑期中尝试解决。

损失函数

在训练过程中，损失函数是尤为重要的，但是在这个网络架构中，模型的输出维度是动态的，在维度上无法很好的与数据的标签维度对应，因此无法直接对其进行求损失。结合本文的训练思路，我采用逐数量一一对应求Loss损失值。

具体意思就是，当模型输出维度与数据标签的维度不匹配时，我只需要根据较小那一部分的输出维度作为计算Loss损失值时的维度。因为模型的输出维度与数据标签维度不同时会存在两种情况，一是模型的输出维度小于数据标签维度（预测点的数量小于目标数量），二是模型的输出维度大于数据标签维度（预测点的数量大于目标数量）。所以为了保持维度上的统一以方便计算Loss损失值，我决定根据较小那一部分的输出维度作为计算Loss损失值时的维度。

所以我们将第二阶段的代码修改如下：

def train_coord_head(model, loader, optimizer, epochs=10, lambda_coord=1.0, freeze_number_head=False, coord_eps=0.5):
    model.train()

    for param in model.coord_fc.parameters():
        param.requires_grad = True

    if freeze_number_head:
        for param in model.number_fc.parameters():
            param.requires_grad = False

    mse_num = nn.MSELoss()
    mse_coord = nn.MSELoss()

    for epoch in range(epochs):
        total_loss_num, total_loss_coord = 0.0, 0.0
        correct_count_num = 0
        total_count_num = 0

        correct_count_coord = 0
        total_count_coord = 0

        for x_batch, coords_list in loader:
            x_batch = x_batch.to(device)

            num_target = torch.tensor(
                [coords.shape[0] for coords in coords_list],
                dtype=torch.float32, device=device
            )

            coords_target_all = torch.cat(
                [coords.to(device) for coords in coords_list], dim=0
            )

            num_pred, coords_pred_all = model(x_batch)

            # ===== 点数预测 =====
            loss_num = mse_num(num_pred.float(), num_target)

            pred_rounded = torch.round(num_pred).clamp(min=1, max=model.max_points)
            correct_count_num += (pred_rounded == num_target).sum().item()
            total_count_num += len(num_target)

            # ===== 坐标预测 =====
            pred_coords_split = []
            target_coords_split = []
            idx_pred = 0
            idx_target = 0
            for i in range(len(coords_list)):
                coords_target_i = coords_list[i].to(device)
                n_target = coords_target_i.shape[0]
                n_pred = int(pred_rounded[i].item())

                coords_pred_i = coords_pred_all[idx_pred: idx_pred + n_pred]
                idx_pred += n_pred
                idx_target += n_target

                # 取最小长度
                n_common = min(n_pred, n_target)
                if n_common > 0:
                    coords_pred_i = coords_pred_i[:n_common]
                    coords_target_i = coords_target_i[:n_common]

                    pred_coords_split.append(coords_pred_i)
                    target_coords_split.append(coords_target_i)

                    # ==== 准确率统计 ====
                    dist = torch.norm(coords_pred_i - coords_target_i, dim=1)
                    correct_count_coord += (dist < coord_eps).sum().item()
                    total_count_coord += n_common

            if pred_coords_split:
                coords_pred_used = torch.cat(pred_coords_split, dim=0)
                coords_target_used = torch.cat(target_coords_split, dim=0)
                loss_coord = mse_coord(coords_pred_used, coords_target_used)
            else:
                loss_coord = torch.tensor(0.0, device=device)

            total_loss = loss_num + lambda_coord * loss_coord

            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

            total_loss_num += loss_num.item()
            total_loss_coord += loss_coord.item()

        acc_num = correct_count_num / total_count_num if total_count_num > 0 else 0.0
        acc_coord = correct_count_coord / total_count_coord if total_count_coord > 0 else 0.0

        print(f"[Coord Head] Epoch [{epoch+1}/{epochs}]  NumLoss: {total_loss_num:.6f}  CoordLoss: {total_loss_coord:.6f}  "
              f"NumAcc: {acc_num:.4f}  CoordAcc(@{coord_eps}): {acc_coord:.4f}")

可以看到，在改进后的代码中，我们在判断坐标预测的准确率的时候，设置了coord_eps参数为0.5，作为距离容差（距离容忍度的差值），用其作为距离容差计算得到的Coord Acc值则表示为预测点距离真实点的欧几里得距离小于0.5的预测点的比例。

模型的性能

由于在模型的训练过程中，采用图3的网络架构训练的效果不是很好，我们重新根据图3的基本思路在特征提取阶段添加了下采样的过程，以及在坐标预测分支中将线性层替换为残差MLP结构，具体代码如下：

class ResBlock(nn.Module):
    def __init__(self, channels: int):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)), inplace=True)
        out = self.bn2(self.conv2(out))
        return F.relu(out + identity, inplace=True)

class DownBlock(nn.Module):
    def __init__(self, in_channels, out_channels, dropout_rate=0.0, use_resblock=True):
        super().__init__()
        self.down = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=4, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
        )
        self.res = ResBlock(out_channels) if use_resblock else nn.Identity()
        self.dropout = nn.Dropout2d(dropout_rate) if dropout_rate > 0 else nn.Identity()

    def forward(self, x):
        x = self.down(x)
        x = self.res(x)
        x = self.dropout(x)
        return x

class ResidualCoordHead(nn.Module):
    def __init__(self, hidden_dim, max_points):
        super().__init__()
        self.max_points = max_points

        self.fc1 = nn.Linear(hidden_dim, hidden_dim)
        self.norm1 = nn.LayerNorm(hidden_dim)
        self.dropout1 = nn.Dropout(0.2)

        self.fc2 = nn.Linear(hidden_dim, hidden_dim // 2)
        self.norm2 = nn.LayerNorm(hidden_dim // 2)
        self.dropout2 = nn.Dropout(0.2)

        self.output_layer = nn.Linear(hidden_dim // 2, max_points * 2)
        self.shortcut = nn.Linear(hidden_dim, hidden_dim // 2)

    def forward(self, x):
        residual = self.shortcut(x)
        x = F.relu(self.norm1(self.fc1(x)))
        x = self.dropout1(x)
        x = F.relu(self.norm2(self.fc2(x)))
        x = self.dropout2(x)
        return self.output_layer(x + residual)

class CNN4TwoStage(nn.Module):
    def __init__(
        self,
        in_channels=1,
        base_channels=32,
        max_points=10,
        hidden_dim=128,
        num_down_blocks=6,
        dropout_rate=0.3,
        use_resblock=True,
    ):
        super().__init__()
        self.max_points = max_points

        # 构建 DownBlock 列表
        self.down_blocks = nn.ModuleList()
        channels = [in_channels] + [base_channels * (2 ** i) for i in range(num_down_blocks)]
        for i in range(num_down_blocks):
            self.down_blocks.append(DownBlock(
                in_channels=channels[i],
                out_channels=channels[i + 1],
                dropout_rate=dropout_rate,
                use_resblock=use_resblock
            ))

        self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
        final_channel = channels[-1]

        self.fc_feature = nn.Sequential(
            nn.Flatten(),
            nn.Linear(final_channel, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Dropout(dropout_rate),
        )

        self.number_fc = nn.Linear(hidden_dim, 1)
        self.coord_fc = ResidualCoordHead(hidden_dim, max_points)

    def forward(self, x):
        batch_size = x.size(0)
        device = x.device

        for block in self.down_blocks:
            x = block(x)

        x = self.global_pool(x)
        feat = self.fc_feature(x)

        num_pred = self.number_fc(feat).squeeze(-1)  # (B,)
        coord_pred = self.coord_fc(feat)  # (B, max_points * 2)

        coords_list = []
        for i in range(batch_size):
            n_i = int(torch.round(num_pred[i]).clamp(1, self.max_points).item())
            coords_i = coord_pred[i, :n_i * 2].view(n_i, 2)
            coords_list.append(coords_i)

        coords_all = torch.cat(coords_list, dim=0) if coords_list else torch.empty(0, 2, device=device)
        return num_pred, coords_all

我大概做了一些简单的对比实验来测量模型的基本性能，再进行训练前，为了实验的可复现性，我设置了随机种子进行初始化模型参数，代码如下，记住这个函数需要放在导库后以及所有函数前，并且需要在所有函数前调用。

def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if using multi-GPU
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed()

特别注意本文中只探究Idea的可实现性，而不做严谨的探讨（比如说还需要在验证集上进行测试等）

效果验证

如图4，我们能够看到阶段1中的MSE损失曲线和预测准确率曲线，均能看到loss都在下降，我们先进行横向对比，也就是固定隐藏层为128，然后对比在不同Block数量下的收敛情况；再固定Block数量为7，然后再对比不同隐藏层数量下的收敛情况。根据MSE损失图像，得到结论——在相同隐藏层下增加Block的数量会加快收敛速度和收敛效果；在相同Block数量下增加隐藏层维度同样也会加快收敛速度和收敛效果。当然，也可以看到准确率的曲线也符合MSE损失图像得到的结论。

如图5，由阶段2中的MSE损失图像依旧可以得到阶段1中的结论，同样证明了“在相同隐藏层下增加Block的数量会加快收敛速度和收敛效果；在相同Block数量下增加隐藏层维度同样也会加快收敛速度和收敛效果”。

如图6，我还探究了在设置的不同距离容差下，阶段2的准确率情况。我发现模型预测的精度大概是在0.1～0.2左右，因为模型在距离容差为0.5，0.4，0.3的情况下都能够保持90%以上高准确率的效果，在距离容差为0.2时，模型的准确率下降明显，尽管同样有接近80%的准确率，但是当容差为0.1时，模型的准确率断崖式下降，虽然使用7-256（Block数量为7，隐藏层维度为256）的模型的准确率比其他模型准确率要高得多，但是准确率仍然很低（低于35%）。所以能够知道模型的坐标预测精度为0.1～0.2左右，总体上预测效果不错。

性能指标

接下来我会以数值的形式来研究模型的效果，但是具体数值表示的意义就不再赘述。

Blocks	Hidden Dim	Num MSE	Coord MSE	Num Acc	Num Time	Coord Time	Param
4	128	0.714229	0.418045	84.00%	499	807	2.2M
5	128	0.316047	0.296123	94.67%	513	814	8.8M
6	128	0.243847	0.113514	97.67%	537	822	34.8M
7	128	0.250646	0.103853	97.00%	542	889	138.9M
7	256	0.159434	0.070690	99.33%	579	904	139.3M

模型7-256在各方面均远优于模型4-128，其中Num MSE下降了约78.87%，Coord MSE下降了约83.25%，Num Acc提高了约15.33%，但是时间开销成本分别只增加了16%和12%。

Blocks	Hidden Dim	Coord(0.5)	Coord(0.4)	Coord(0.3)	Coord(0.2)	Coord(0.1)
4	128	81.65%	53.01%	29.52%	14.09%	4.24%
5	128	90.39%	73.82%	49.58%	25.25%	7.59%
6	128	99.83%	96.95%	86.80%	62.75%	23.74%
7	128	100%	97.97%	88.18%	66.78%	25.50%
7	256	100%	99.66%	95.81%	77.76%	36.29%

如上表示，在确定精度的时，模型7-256依旧要在各方面远优于模型4-128，当距离容差分别为0.5，0.4，0.3，0.2，0.1时，预测的准确率（精准度）分别提高了18%，46%，66%，63%，32%。

小结

总体上来看，本文的级联神经网络的Idea是可行的，但是在具体的实验过程中还是会有一些不严谨的地方存在，例如生成数据集的时候采用的随机坐标设置，这个也算上另外一种数据污染吧，还有在实验中没有设置验证集和测试集，当然还有很多其他的问题，在此就不再赘述，最后再次强调本文只是用于验证Idea的可实现性，并非是一篇严谨的论文或者技术分享帖，感谢理解。