Visible to the public 2022 AINNCS BenchmarksConflict Detection Enabled

20 replies [Last post]
ttj
ttj's picture
Offline
Established Community Member
Joined: Aug 15 2012

We will discuss the 2022 ARCH-COMP AINNCS benchmarks here. We plan to mostly reuse the 2021 benchmarks, which were also mostly used in the 2020 edition.

2021 benchmarks:

https://github.com/verivital/ARCH-COMP2021

2021 report "ARCH-COMP21 Category Report: Artificial Intelligence and Neural Network Control Systems (AINNCS) for Continuous and Hybrid Systems Plants":

https://easychair.org/publications/paper/Jq4h

2021 repeatability:

https://gitlab.com/goranf/ARCH-COMP/-/tree/master/2021/AINNCS

toladner
toladner's picture
Offline
Established Community Member
Joined: Apr 5 2022
Standardize Result Reporting

Dear all,

I would suggest standardizing the result reporting for better comparability in the competition.
I feel like that's missing from last year's competition.

For example, if a benchmark can be falsified, some picked a falsifying setting in advance (e.g. a subset of the initial set or a specific simulation) and let the tool run with that setting.
But that's not the tool picking the settings, so the results can be misleading.

For a tool to verify or falsify a given benchmark, it would need to e.g.

  • Do some simulations (using random/educated guesses)
  • Check if any of these violate the specifications
  • If not, we can compute an overapproximation of the reachable set
  • Check if the set violates the specifications.

thus, the tool output could look like:

  • BENCHMARK: Adaptive Cruise Controller (ACC)
  • Time to compute random Simulations: 0.91241
  • Time to check Violation in Simulations: 0.002291
  • Time to compute Reachable Set: 1.7505
  • Time to check Verification: 0.028102
  • Result: VERIFIED
  • Total Time: 2.6933

kidlovedog
kidlovedog's picture
Offline
Established Community Member
Joined: Jun 11 2022
New Benchmarks

Hi all,

I am proposing two new benchmarks, a 6D attitude control benchmark and a 12D quadrotor benchmark. The model description and neural network controllers with .onnx format can be found as attached.

Update 15 June 2022: Bugs of the .onnx files have been fixed. Please download the newest version.

AttachmentTaxonomyKindSize
ARCH-COMP2022.zipZIP archive210.69 KBDownloadPreview

ARCH-COMP2022.zip
toladner
toladner's picture
Offline
Established Community Member
Joined: Apr 5 2022
Stopping Criterion

When do we consider these benchmarks as verified/violated?
With all benchmarks from last year, we had a final time step,
whereas these benchmarks are about showing stability.

A violation criterion could be if a simulation run is still outside some goal region after x seconds,
a verification criterion if the reachable set of a timestep is a subset of a previous time step and within the goal region.

The suggested verification criterion might become an issue when the set oscillates around some desired value.
E.g. for the attitude control benchmark, I have attached some example plots. The angular velocity keeps oscillating around 0.

AttachmentTaxonomyKindSize
08_attitudeControl_01.pngPNG image26.18 KBDownloadPreview
08_attitudeControl_02.pngPNG image26.54 KBDownloadPreview
Preview: Preview | Thumbnail | Medium | Image

Other available formats:     

08_attitudeControl_01.png
Preview: Preview | Thumbnail | Medium | Image

Other available formats:     

08_attitudeControl_02.png
uravaioli
uravaioli's picture
Offline
Established Community Member
Joined: Jun 9 2022
New Benchmark - 2D Spacecraft Docking

Proposing a new benchmark environment from the AFRL Safe Autonomy Team.

This environment is a simplified 2d satellite docking problem where a pointmass satellite with thrusters in the +/- x/y directions attempts to navigate to the origin while obeying a distance dependent velocity limit.

Included is the neural network model in onnx format, MATLAB system dynamics, specification txt, and a detailed description pdf.

Looking forward to getting community feedback!

AttachmentTaxonomyKindSize
afrl_docking_benchmark_arch_comp_2022.zipZIP archive690.95 KBDownloadPreview

afrl_docking_benchmark_arch_comp_2022.zip
schillic
schillic's picture
Offline
Established Community Member
Joined: Dec 1 2016
Comments on proposals

Thanks for the proposals. I have some comments and questions:


@toladner#1: I agree with you that the falsification cases are treated rather arbitrarily. Still I am hesitant to enforce such a rule this year. One reason is that not many tools have been participating last time. Conceptually what you propose (running random simulations) is simple, but still this creates another entry burden. The second reason is that now we are already close to the submission deadline. Next year we can have an early discussion how to treat those cases. If you think that the current comparison does not make sense, I would be fine with dropping the falsification benchmarks for this year, given that (I guess) most tools are interested in verification anyway.


@kidlovedog#2: I agree with @toladner#3 that we would need a clarification about the specifications:

1) For the attitude controller you do not provide any specification. For instance, the models last year all specified a time horizon and either a safety property or a "guaranteed-reachability" property.
2) For the quadrotor, can you provide a time horizon for those tools that cannot check unbounded time?
3) For the quadrotor, I think there is a typo in the document you sent. Please compare the x9' with the line in your code.


@uravaioli#4: Can you provide a time horizon for those tools that cannot check unbounded time?

uravaioli
uravaioli's picture
Offline
Established Community Member
Joined: Jun 9 2022
Time Horizon for Spacecraft Docking Benchmark

@schillic

We don't have an idea of what would be a reasonable time horizon to be honest and were hoping for some feedback from the community.

Let's start with 40 timesteps.

kidlovedog
kidlovedog's picture
Offline
Established Community Member
Joined: Jun 11 2022
New Update on Two Benchmarks

Thanks for the comment.

I have updated the document and please see the attached.

Update 2 July 2022:

AttachmentTaxonomyKindSize
ARCH-COMP2022.zipZIP archive213.24 KBDownloadPreview

ARCH-COMP2022.zip
schillic
schillic's picture
Offline
Established Community Member
Joined: Dec 1 2016
Update on Two Benchmarks

@kidlovedog: Thanks! There is still a typo in the equation (- instead of +).

As for results, we can easily verify the first model (attitude control) in 1.6 seconds. The second model (quadrotor) is tough even from a very small initial region. In simulations the system looks fine and stable, though.

toladner
toladner's picture
Offline
Established Community Member
Joined: Apr 5 2022
Typo

sorry, but where is the typo?

Having the same trouble as @schillic with the quadrotor benchmark.
Similarly, the spacecraft docking benchmark seems to be hard as well for our approach.

schillic
schillic's picture
Offline
Established Community Member
Joined: Dec 1 2016
Typo

x9' is supposed to be the sum of two expressions, but it is the difference in the PDF. See here for the implementation this is based on.

schillic
schillic's picture
Offline
Established Community Member
Joined: Dec 1 2016
Preprocessing in 2D Spacecraft Docking

@uravaioli#4: Could you also provide a network without the pre-/postprocessing included plus a description of what these operations look like? The ONNX parser we have to use does not support matrix multiplication.

uravaioli
uravaioli's picture
Offline
Established Community Member
Joined: Jun 9 2022
Matrix Multiplication Alternative?

@schilic

I can work on providing this, however the core network itself is based on matrix multiplication to implement the neural network layers. What set of onnx operations to implement neural networks do you support?

schillic
schillic's picture
Offline
Established Community Member
Joined: Dec 1 2016
Matrix multiplication

@uravaioli

To be clear, I meant matrix-matrix multiplication, which I thought comes from the preprocessing map P followed by the first weight matrix W. I imagined that you can instead just drop P and replace W by the result of P * W. If you also need this feature in the inner layers and cannot just write the resulting matrix instead, then we will probably not support this model and I suggest to not spend time on an alternative. We probably only support standard network architectures where a layer has the form layer(x) = act(x * W + b). But this is fine; not every tool has to support every model.

uravaioli
uravaioli's picture
Offline
Established Community Member
Joined: Jun 9 2022
Matrix Multiplication

@schillic

Thank you for the clarification. The uploaded model should have only vector-matrix multiplications, the only difference from a standard neural network model in the preprocessing is removing b (implicit 0) and removing act (implicit linear activation). The input to the prepocessing is a vector and the output should be a vector.

Is your parser throwing an error when attempting to parse the model.

schillic
schillic's picture
Offline
Established Community Member
Joined: Dec 1 2016
Matrix multiplication

Exactly, I get this error:

+ Error: Error while loading node ("model/preprocess_norm_scale/MatMul:0",) = MatMul("input_1", "model/preprocess_norm_scale/MatMul/R
eadVariableOp:0")
+ @ ONNX ~/.julia/packages/ONNX/VJnjv/src/load.jl:66
ERROR: LoadError: MatMul with arrays of 1 and 2 is not implemented yet

Maybe it would help if you insert a zero b.

toladner
toladner's picture
Offline
Established Community Member
Joined: Apr 5 2022
aircraft benchmark

@uravaioli

I'm also a bit puzzled by the aircraft benchmark. I'm unsure if the network is loaded correctly.
I have attached an image of some simulation runs.
Do they look correct?

In general, I think it would be nice to have [an image on] some simulation runs by the benchmark proposer for a quick sanity check: if the network is loaded correctly, if any normalization is missing, etc.

AttachmentTaxonomyKindSize
spacecraft.pngPNG image19.6 KBDownloadPreview
Preview: Preview | Thumbnail | Medium | Image

Other available formats:     

spacecraft.png
manzand
manzand's picture
Offline
Established Community Member
Joined: Apr 24 2019
Benchmark Repository

Hi everyone,

We have created the new GitHub repo for this year's benchmarks: archcomp2022

For the docking benchmark, I have reconverted the model to onnx including bias vectors to pre- and post-processing, following this function: load_docking_controller.m. The converted model can be found here: bias_model.onnx.

schillic
schillic's picture
Offline
Established Community Member
Joined: Dec 1 2016
Spacecraft docking

Hi @manzand, thanks for the efforts. I sent a few fixes.

About the bias_model.onnx for the spacecraft-docking model: Unfortunately, the parser we use also cannot deal with "Sub" operations. But the .mat file is helpful, thanks for that! I noticed that the output layer is 4-dimensional, but I guess it should be 2-dimensional:

W = [1 0 0 0; 0 0 1 0; 0 0 0 0; 0 0 0 0]
b = [0, 0, 0, 0]
act = Id

I guess it is safe to just remove the last two output dimensions, which are just 0 anyway.

As for results, simulations suggest that the model is safe, but our tool cannot prove that except for a very small initial set.

manzand
manzand's picture
Offline
Established Community Member
Joined: Apr 24 2019
Spacecraft docking

Hi Christian,

Thanks for the feedback. I used MATLAB to export this onnx network using the exportONNXnetwork function. Here, we can specify different OpsetVersions, which correspond to different ways of saving the network. Depends on the version used (default = 8), it may work with other parsers / importers. Do you know if another version may work with your parser? I tested the saving and loading from and into MATLAB using all possible versions and nothing seemed to changed when using MATLAB to load it, but this of course will change when using different parsers (PyTorch/Tensorflow/Julia...).

As far as results go, I have not tested it yet. If this benchmark becomes an issue for all participants, we can change the specifications (like smaller initial set) for this or future iterations. We also do not need to include it in this year's report, but we should all try to verify it to at least have a sense of what challenges this benchmark presents, and what makes sense to consider in the future.

toladner
toladner's picture
Offline
Established Community Member
Joined: Apr 5 2022
Difficult Benchmarks

@manzand

I agree. We should include the benchmark in the report even if it is an issue for all participants. This shows what current tools cannot (yet) solve and we might see progress in the future. For future iterations, having a simpler version of this benchmark might be helpful.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.