工业基准控制任务

任务介绍

example-of-lander_hover

工业基准（IB）是一个强化学习基准环境，旨在模拟各种工业控制任务中的特性，例如：风力或燃气轮机、化学反应器。它囊括了真实世界中工业领域常见的诸多问题，如：连续状态和动作空间的高维度性、延迟奖励、复杂噪声模式以及多个反应目标的高随机性。我们还通过将系统状态的两个维度添加到观察空间，来计算每个步骤的即时奖励，从而对原始工业基准环境进行了数据增广。由于IB本身是一个高维和高度随机的环境，在这个环境上进行采样的时候，并不会对动作数据进行加噪的处理。

项目	描述
Action Space	Continuous(3,)
Observation	Shape (180,)

动作空间

动作空间由连续的 3 维向量组成，详细信息请参考 http://polixir.ai/research/neorl。

观察空间

状态是一个 180 维向量。事实上，每个时刻的观测是 6 维向量，数据集自动拼接了前 29 帧的数据，因此当前观测的维度为 $180 = 6 * 30$ 。详细信息请参考 http://polixir.ai/research/neorl。

任务目标

需要让工业基准任务的各项指标维持在目标值附近，详细信息请参考工业基准任务的奖励函数。

训练模型

REVIVE是一个历史数据驱动的工具。根据文档教程部分的描述，在冰箱温控任务上使用REVIVE可以分为以下几步：

收集历史决策数据：收集冰箱温控任务的历史决策数据。
构建决策流图和数组数据：
- 结合业务场景和收集的历史数据构建决策流图和数组数据。
- 决策流图主要描述业务数据的交互逻辑，使用 .yaml 文件存储。
- 数组数据存储决策流图中定义的节点数据，使用 .npz 或 .h5 文件存储。
定义奖励函数：
- 为了获得更优的控制策略，需要根据任务目标定义奖励函数。
- 奖励函数定义了策略的优化目标，可以指导控制策略将冰箱内温度更好地控制在理想温度附近。
开始模型训练：
- 定义完决策流图、训练数据和奖励函数后，可以使用REVIVE开始虚拟环境模型训练和策略模型训练。
上线测试：
- 将REVIVE训练的策略模型进行上线测试。

准备数据

我们使用Neorl中的IB数据集和奖励函数来构建训练任务。详细信息请参考 http://polixir.ai/research/neorl。

定义决策流图

IB 任务的完整训练过程涉及到异构决策流图加载。

以下是 训练虚拟环境 时的 .yaml 文件：

yaml

metadata:
    columns:
    - obs_0:
        dim: obs
        type: continuous
    - obs_1:
        dim: obs
        type: continuous
    ...
    - obs_179:
        dim: obs
        type: continuous

    - obs_0:
        dim: current_next_obs
        type: continuous
    - obs_1:
        dim: current_next_obs
        type: continuous
    ...
    - obs_5:
        dim: current_next_obs
        type: continuous

    - action_0:
        dim: action
        type: continuous
    - action_1:
        dim: action
        type: continuous
    - action_2:
        dim: action
        type: continuous

    graph:
        #action:
        #- obs
        current_next_obs:
        - obs
        - action
        next_obs:
        - obs
        - current_next_obs

    expert_functions:
        next_obs:
        'node_function' : 'expert_function.next_obs'

以下是 训练策略 时的 .yaml 文件：

yaml

metadata:
    columns:
    - obs_0:
        dim: obs
        type: continuous
    - obs_1:
        dim: obs
        type: continuous
    ...
    - obs_179:
        dim: obs
        type: continuous

    - obs_0:
        dim: current_next_obs
        type: continuous
    - obs_1:
        dim: current_next_obs
        type: continuous
    ...
    - obs_5:
        dim: current_next_obs
        type: continuous

    - action_0:
        dim: action
        type: continuous
    - action_1:
        dim: action
        type: continuous
    - action_2:
        dim: action
        type: continuous

    graph:
        action:
        - obs
        current_next_obs:
        - obs
        - action
        next_obs:
        - obs
        - current_next_obs

    expert_functions:
        next_obs:
        'node_function' : 'expert_function.next_obs'

    #nodes:
    #  action:
    #      step_input: True

定义奖励函数

这里我们定义了 IB 任务中策略节点的奖励函数：

python

import torch
from typing import Dict

def get_reward(data : Dict[str, torch.Tensor]) -> torch.Tensor:
    obs = data["obs"]
    next_obs = data["next_obs"]

    single_reward = False
    if len(obs.shape) == 1:
        single_reward = True
        obs = obs.reshape(1, -1)
    if len(next_obs.shape) == 1:
        next_obs = next_obs.reshape(1, -1)

    CRF = 3.0
    CRC = 1.0
    fatigue = next_obs[:, 4]
    consumption = next_obs[:, 5]

    cost = CRF * fatigue + CRC * consumption
    reward = -cost

    if single_reward:
        reward = reward[0].item()
    else:
        reward = reward.reshape(-1, 1)

    return reward

训练控制策略

REVIVE已经提供了训练所需的数据和代码，详情请参考 REVIVE 源码库。完成REVIVE的安装后，可以切换到 examples/task/IB 目录下，运行下面的Bash命令开启虚拟环境模型训练和策略模型训练。在训练过程中，我们可以随时使用tensorboard打开日志目录以监控训练过程。当REVIVE完成虚拟环境模型训练和策略模型训练后。我们可以在日志文件夹（logs/<run_id>）下找到保存的模型（ .pkl 或 .onnx）。

bash

# 训练环境模型
python train.py \
    -df data/ib.npz \
    -cf data/ib_env.yaml \
    -rf data/ib_reward.py \
    -rcf data/config.json \
    -vm tune \
    -pm None \
    --run_id revive

# 训练策略模型
python train.py \
    -df data/ib.npz \
    -cf data/ib_policy.yaml \
    -rf data/ib_reward.py \
    -rcf data/config.json \
    -vm None \
    -pm tune \
    --run_id revive

测试模型

训练完成后，可以用提供的jupyternotebook脚本对完成训练的策略性能进行测试。具体请参考 jupyter notebook。

工业基准控制任务 ​

任务介绍 ​

动作空间 ​

观察空间 ​

任务目标 ​

训练模型 ​

准备数据 ​

定义决策流图 ​

定义奖励函数 ​

训练控制策略 ​

测试模型 ​