2022 AINNCS Benchmarks
We will discuss the 2022 ARCH-COMP AINNCS benchmarks here. We plan to mostly reuse the 2021 benchmarks, which were also mostly used in the 2020 edition.
2021 benchmarks:
https://github.com/verivital/ARCH-COMP2021
2021 report "ARCH-COMP21 Category Report: Artificial Intelligence and Neural Network Control Systems (AINNCS) for Continuous and Hybrid Systems Plants":
https://easychair.org/publications/paper/Jq4h
2021 repeatability:
https://gitlab.com/goranf/ARCH-COMP/-/tree/master/2021/AINNCS
Dear all,
I would suggest standardizing the result reporting for better comparability in the competition.
I feel like that's missing from last year's competition.
For example, if a benchmark can be falsified, some picked a falsifying setting in advance (e.g. a subset of the initial set or a specific simulation) and let the tool run with that setting.
But that's not the tool picking the settings, so the results can be misleading.
For a tool to verify or falsify a given benchmark, it would need to e.g.
- Do some simulations (using random/educated guesses)
- Check if any of these violate the specifications
- If not, we can compute an overapproximation of the reachable set
- Check if the set violates the specifications.
thus, the tool output could look like:
- BENCHMARK: Adaptive Cruise Controller (ACC)
- Time to compute random Simulations: 0.91241
- Time to check Violation in Simulations: 0.002291
- Time to compute Reachable Set: 1.7505
- Time to check Verification: 0.028102
- Result: VERIFIED
- Total Time: 2.6933
Hi all,
I am proposing two new benchmarks, a 6D attitude control benchmark and a 12D quadrotor benchmark. The model description and neural network controllers with .onnx format can be found as attached.
Update 15 June 2022: Bugs of the .onnx files have been fixed. Please download the newest version.
When do we consider these benchmarks as verified/violated?
With all benchmarks from last year, we had a final time step,
whereas these benchmarks are about showing stability.
A violation criterion could be if a simulation run is still outside some goal region after x seconds,
a verification criterion if the reachable set of a timestep is a subset of a previous time step and within the goal region.
The suggested verification criterion might become an issue when the set oscillates around some desired value.
E.g. for the attitude control benchmark, I have attached some example plots. The angular velocity keeps oscillating around 0.
Proposing a new benchmark environment from the AFRL Safe Autonomy Team.
This environment is a simplified 2d satellite docking problem where a pointmass satellite with thrusters in the +/- x/y directions attempts to navigate to the origin while obeying a distance dependent velocity limit.
Included is the neural network model in onnx format, MATLAB system dynamics, specification txt, and a detailed description pdf.
Looking forward to getting community feedback!
Thanks for the proposals. I have some comments and questions:
@toladner#1: I agree with you that the falsification cases are treated rather arbitrarily. Still I am hesitant to enforce such a rule this year. One reason is that not many tools have been participating last time. Conceptually what you propose (running random simulations) is simple, but still this creates another entry burden. The second reason is that now we are already close to the submission deadline. Next year we can have an early discussion how to treat those cases. If you think that the current comparison does not make sense, I would be fine with dropping the falsification benchmarks for this year, given that (I guess) most tools are interested in verification anyway.
@kidlovedog#2: I agree with @toladner#3 that we would need a clarification about the specifications:
1) For the attitude controller you do not provide any specification. For instance, the models last year all specified a time horizon and either a safety property or a "guaranteed-reachability" property.
2) For the quadrotor, can you provide a time horizon for those tools that cannot check unbounded time?
3) For the quadrotor, I think there is a typo in the document you sent. Please compare the x9' with the line in your code.
@uravaioli#4: Can you provide a time horizon for those tools that cannot check unbounded time?
@schillic
We don't have an idea of what would be a reasonable time horizon to be honest and were hoping for some feedback from the community.
Let's start with 40 timesteps.
Thanks for the comment.
I have updated the document and please see the attached.
Update 2 July 2022:
@kidlovedog: Thanks! There is still a typo in the equation (- instead of +).
As for results, we can easily verify the first model (attitude control) in 1.6 seconds. The second model (quadrotor) is tough even from a very small initial region. In simulations the system looks fine and stable, though.
sorry, but where is the typo?
Having the same trouble as @schillic with the quadrotor benchmark.
Similarly, the spacecraft docking benchmark seems to be hard as well for our approach.
@uravaioli#4: Could you also provide a network without the pre-/postprocessing included plus a description of what these operations look like? The ONNX parser we have to use does not support matrix multiplication.
@schilic
I can work on providing this, however the core network itself is based on matrix multiplication to implement the neural network layers. What set of onnx operations to implement neural networks do you support?
@uravaioli
To be clear, I meant matrix-matrix multiplication, which I thought comes from the preprocessing map P followed by the first weight matrix W. I imagined that you can instead just drop P and replace W by the result of P * W. If you also need this feature in the inner layers and cannot just write the resulting matrix instead, then we will probably not support this model and I suggest to not spend time on an alternative. We probably only support standard network architectures where a layer has the form layer(x) = act(x * W + b). But this is fine; not every tool has to support every model.
@schillic
Thank you for the clarification. The uploaded model should have only vector-matrix multiplications, the only difference from a standard neural network model in the preprocessing is removing b (implicit 0) and removing act (implicit linear activation). The input to the prepocessing is a vector and the output should be a vector.
Is your parser throwing an error when attempting to parse the model.
Exactly, I get this error:
+ Error: Error while loading node ("model/preprocess_norm_scale/MatMul:0",) = MatMul("input_1", "model/preprocess_norm_scale/MatMul/R
eadVariableOp:0")
+ @ ONNX ~/.julia/packages/ONNX/VJnjv/src/load.jl:66
ERROR: LoadError: MatMul with arrays of 1 and 2 is not implemented yet
Maybe it would help if you insert a zero b.
@uravaioli
I'm also a bit puzzled by the aircraft benchmark. I'm unsure if the network is loaded correctly.
I have attached an image of some simulation runs.
Do they look correct?
In general, I think it would be nice to have [an image on] some simulation runs by the benchmark proposer for a quick sanity check: if the network is loaded correctly, if any normalization is missing, etc.
Hi everyone,
We have created the new GitHub repo for this year's benchmarks: archcomp2022
For the docking benchmark, I have reconverted the model to onnx including bias vectors to pre- and post-processing, following this function: load_docking_controller.m. The converted model can be found here: bias_model.onnx.
Hi @manzand, thanks for the efforts. I sent a few fixes.
About the bias_model.onnx for the spacecraft-docking model: Unfortunately, the parser we use also cannot deal with "Sub" operations. But the .mat file is helpful, thanks for that! I noticed that the output layer is 4-dimensional, but I guess it should be 2-dimensional:
W = [1 0 0 0; 0 0 1 0; 0 0 0 0; 0 0 0 0]
b = [0, 0, 0, 0]
act = Id
I guess it is safe to just remove the last two output dimensions, which are just 0 anyway.
As for results, simulations suggest that the model is safe, but our tool cannot prove that except for a very small initial set.
Hi Christian,
Thanks for the feedback. I used MATLAB to export this onnx network using the exportONNXnetwork function. Here, we can specify different OpsetVersions, which correspond to different ways of saving the network. Depends on the version used (default = 8), it may work with other parsers / importers. Do you know if another version may work with your parser? I tested the saving and loading from and into MATLAB using all possible versions and nothing seemed to changed when using MATLAB to load it, but this of course will change when using different parsers (PyTorch/Tensorflow/Julia...).
As far as results go, I have not tested it yet. If this benchmark becomes an issue for all participants, we can change the specifications (like smaller initial set) for this or future iterations. We also do not need to include it in this year's report, but we should all try to verify it to at least have a sense of what challenges this benchmark presents, and what makes sense to consider in the future.
@manzand
I agree. We should include the benchmark in the report even if it is an issue for all participants. This shows what current tools cannot (yet) solve and we might see progress in the future. For future iterations, having a simpler version of this benchmark might be helpful.