Automatic evaluation of benchmarks

13 replies [Last post]

Tue, 06/09/2020 - 3:27am

althoff

Offline

Joined: Oct 23 2013

In this froup we discuss how to organize the automatic evaluation of benchmarks. This allows us to run all benchmarks on the same machine so that we can perform the repeatability evaluation and the collection of computation times in one go.

Tue, 06/09/2020 - 3:39am

althoff

Offline

Joined: Oct 23 2013

Writing solutions as CSV files

Dear all,

Taylor has kicked off the automatic evaluation of benchmarks with a nice script:

https://gitlab.com/goranf/ARCH-COMP/-/blob/master/2019/AFF/docker_batch.py

In order to avoid fetching information from the console (lines after line 167), we propose a simple CSV file structure:

<nameOfTool>\n
<string: BenchmarkID>;<bool: verified>;<double: computationTime>\n
<string: BenchmarkID>;<bool: verified>;<double: computationTime> \n
"End of "<nameOfTool> \n\n #this is just a sanity check

Example:

CORA\n
ISSF01ISU01; 1; 201.33 \n
ISSC01ISS01; 1; 50.54 \n
End of CORA

Please comment soon so that we have time to test this.

Tue, 06/09/2020 - 8:12am

stanleybak

Offline

Joined: Apr 29 2015

generalize docker_batch.py

Right now docker_batch.py runs the specific commands for each tool, for example for spaceex it's hardcoded that you have:

string_cmd = 'docker run -v ' + string_path_host + '/SpaceEx/:/root/ -w /root spaceex -g ' + example[t][i][0] + '/' + example[t][i][1] + '.cfg -m ' + example[t][i][0] + '/' + example[t][i][2] + '.xml' + ' -vl'

for hylaa you have:

string_cmd = 'docker run -w \"' + example[t][i][0] + '\" hylaa python3 ' + example[t][i][0] + example[t][i][1] breakout_examples = 0

There's also hardcoded logic for recording the runtimes. I think the proposed csv file format prevents tool-specific parsing of output, but it doesn't solve the custom run command problem. In Taylor's script there's the comment that: "some tools provide a script to run everything at once whereas some tools provide a command line call to run each"

I think the easiest would be to have a single script with a fixed name in each docker container to run all the measurements and produce the .csv file result. For example, we require a shell script called "./measure_all" that produces "results.csv". The general script would then run this command and then copy from the docker container to the host: "docker cp <containerId>:/file/path/within/container /host/path/target". For example it would copy "results.csv" to "cora_results.csv" on the host.

Also, since it's a csv file why don't we use commas as the delimiter instead of semicolons? I think it would be more consistent to keep the tool name on the first column rather than a row by itself, so that every row has the same layout. I also think is strange to use two different delimiters for the benchmark instance and the benchmark category and the rest of the columns... why not use the same one? So I propose your example to become:

CORA,ISSF01,SU01,1,201.33\n
CORA,ISSC01,SS01,1,50.54\n

Finally, what's the indended meaning of "verified"? Does this mean safe/unsafe or success / tool failed?

Wed, 06/10/2020 - 7:40am

schillic

Offline

Joined: Dec 1 2016

Multiple CSV files

Independent of Stan's comments (which I support), do you want us to create a single .csv file for each benchmark family (like ISS and Building) or should we rather put everything in one big file? From docker_batch.py it seems to me that you want to load all results into a single dictionary, so I guess one file should be good. Can the script deal with both single and multiple files?

Finally, what's the indended meaning of "verified"? Does this mean safe/unsafe or success / tool failed?

Indeed, this is unclear. Do we need to report all instances that were not verified with a 0 or can we just not report them? Is there a difference?

Wed, 06/10/2020 - 9:32am

althoff

Offline

Joined: Oct 23 2013

Multiple vs Single CSV file

Dear Stan and Christian,

thank you for your comments.

Let me start with the easy question: What does verified mean?

We have some problems that should be verified and others that should not be verified. A safe system should be verified and an unsafe system should not be verified.

An example in the linear systems category is ISS01, which should be verified and ISU01, which should not be verified.

Single vs Multiple CSV Files

It is probably easier to have one csv file per tool. I am actually open regarding how this is done. My main intention was to get the discussion started.

Tentative decision: a single file

Others

There are many possible styles for csv. The only reason why I suggested ';' was that in some countries, a comma is used as a decimal separator. For instance, in Germany, one would write "123,456" instead of "123.456". However, it is easy to agree on the decimal point as a decimal separator.

Tentative decision: Comma to separate values and point as a decimal seperator

Regarding tool name, benchmark ID and instance: The format that Stan has proposed looks fine as well.

Tentative decision: <ToolName>, <benchmark ID>, <instance (possibly empty)>, <verified (0/1)>, <comp. time>

If there are no new comments by Friday, we will modify the script and upload it in our git repository so that everybody can change the part regarding her/his tool.

Wed, 06/10/2020 - 11:58am

lgeretti

Offline

Joined: Dec 13 2016

Verified and metric score

In the NLN group we introduced a score (e.g., the width of a given variable at the final time) to quantify the quality of approximation when the properties are satisfied. This allows to better frame the performance of the tool.

Additionally, a tool must always satisfy the properties to provide a score and timing result. Since not being able to verify removes the lower bound on the accuracy used for computation, it does not make much sense to supply those.

Summarizing, we don't see a use for the verified field, but an optional score field would be needed. Alternatively, we can duplicate the entries for a given benchmark, having a class for metric and another one for computation time.

Wed, 06/10/2020 - 1:16pm

mforets

Offline

Joined: Jun 9 2017

Here is a naming proposal for

Here is a naming proposal for the NLN category using an additional field(s) to report the metric score as suggested by Luca:

Tool, PRDE20, I, runtime, final volume

Tool, PRDE20, P, runtime, final volume

Tool, PRDE20, IP, runtime, final volume

Tool, CVDP20, mu1, runtime

Tool, CVDP20, mu2, runtime

Tool, LALO20, W001, runtime, final width

Tool, LALO20, W005, runtime, final width

Tool, LALO20, W01, runtime, final width

Tool, QUAD20, delta01, runtime

Tool, QUAD20, delta04, runtime

Tool, QUAD20, delta08, runtime

Tool, LOVO20, , runtime, area, cntmax

Tool, SPRE20, , runtime

Fri, 06/12/2020 - 12:04pm

frehse

Offline

Joined: Nov 6 2013

Verified Field

A verification tool can output three types of results: verified (property provably satisfied), falsified (property provably violated), and unknown. Currently, most tools in this competition only give "verified" or "unknown", but some do also falsify (e.g., in the PWC category). So I suggest using a third value (-1) for "falsified".

@Luca, I do think the verified field is useful. For example, a buggy tool could simply output the correct result by chance. The interest of falsifiable instances is to serve as a binary test. We supply two versions of the model: one that should give verified and one that should give falsified. Both should be given to the tool with all other parameters and the property being the same. If the test is set up properly, only a tool that correctly solves the problem should be able to always give the correct result, since it won't know which one of the two versions it is being given (ok, I'm assuming no malicious trickery like parsing the model to see which one it is).

Thu, 06/11/2020 - 7:54am

stanleybak

Offline

Joined: Apr 29 2015

run command

Matthias I saw you commented on the other things, but what about having a single fixed-name command to run the measurements and produce the csv files (like ./measure_all). We need to agree upon the filenames for the script to work.

Fri, 06/12/2020 - 12:10pm

frehse

Offline

Joined: Nov 6 2013

CSV file

I agree with Matthias' suggestion, adding -1 for falsified and an instance-dependent score field.
For the score field to be at least somewhat uniform, I propose that we fix the direction: smaller is better. One could call it "precision measure" or "loss".

Tentative decision: <ToolName>, <benchmark ID>, <instance (possibly empty)>, <verified (-1/0/1)>, <comp. time>, <loss>

Tue, 06/16/2020 - 4:19am

#10

lgeretti

Offline

Joined: Dec 13 2016

Multiple "loss"

We request the ability to supply multiple loss measures.

By the way, since the loss field is optional,

should be a correct CSV entry with no need for a trailing comma.

Wed, 06/24/2020 - 9:34am

#11

althoff

Offline

Joined: Oct 23 2013

Final decision on format + name of command to produce csv-files

Dear all,

thanks for your suggestions. I think we have sufficiently converged:

Format of csv file

Explanations:

- verified: -1: falsified, 0: not verified, 1: verified

- custom vector (optional): this can be freely specified by each group. E.g. Luca's group can specify multiple loss measures. The number of comma-separated values specifying the vecor is arbitrary.

Example:

Tool, PRDE20, IP, verified, runtime, customValue1, customvalue2, customValue3

Name of command producing the csv files

I will upload the script to our git repository for the linear group. This should be done by the other group leaders as well. Thus, all participants can adjust the script so that their tool is properly called so that we might not need a common name to call the function of a tool producing the csv-file, but it makes it easier for others to check our results.

Thus, I suggest to use measure_all as suggested by Stan.

Thanks for the great discussion!

Wed, 06/24/2020 - 10:40am

#12

lgeretti

Offline

Joined: Dec 13 2016

I would add that, if a tool

I would add that, if a tool is not capable of producing a result for a particular benchmark/instance, then its CSV file still shows a corresponding entry, but with verified=0 and empty time. At least according to our group, if the specification can't be met, then no timing and loss values are meaningful.

Wed, 06/24/2020 - 11:22am

#13

stanleybak

Offline

Joined: Apr 29 2015

output filename

Did we specify the output filename as well? Like results.csv? This can obviously be renamed by the high-level script that copies the results from the docker instance to the host system.

Automatic evaluation of benchmarks

Comment viewing options