Automatic evaluation of benchmarks
In this froup we discuss how to organize the automatic evaluation of benchmarks. This allows us to run all benchmarks on the same machine so that we can perform the repeatability evaluation and the collection of computation times in one go.
Right now docker_batch.py runs the specific commands for each tool, for example for spaceex it's hardcoded that you have:
string_cmd = 'docker run -v ' + string_path_host + '/SpaceEx/:/root/ -w /root spaceex -g ' + example[t][i][0] + '/' + example[t][i][1] + '.cfg -m ' + example[t][i][0] + '/' + example[t][i][2] + '.xml' + ' -vl'
for hylaa you have:
string_cmd = 'docker run -w \"' + example[t][i][0] + '\" hylaa python3 ' + example[t][i][0] + example[t][i][1] breakout_examples = 0
There's also hardcoded logic for recording the runtimes. I think the proposed csv file format prevents tool-specific parsing of output, but it doesn't solve the custom run command problem. In Taylor's script there's the comment that: "some tools provide a script to run everything at once whereas some tools provide a command line call to run each"
I think the easiest would be to have a single script with a fixed name in each docker container to run all the measurements and produce the .csv file result. For example, we require a shell script called "./measure_all" that produces "results.csv". The general script would then run this command and then copy from the docker container to the host: "docker cp <containerId>:/file/path/within/container /host/path/target". For example it would copy "results.csv" to "cora_results.csv" on the host.
Also, since it's a csv file why don't we use commas as the delimiter instead of semicolons? I think it would be more consistent to keep the tool name on the first column rather than a row by itself, so that every row has the same layout. I also think is strange to use two different delimiters for the benchmark instance and the benchmark category and the rest of the columns... why not use the same one? So I propose your example to become:
CORA,ISSF01,SU01,1,201.33\n
CORA,ISSC01,SS01,1,50.54\n
Finally, what's the indended meaning of "verified"? Does this mean safe/unsafe or success / tool failed?
Independent of Stan's comments (which I support), do you want us to create a single .csv file for each benchmark family (like ISS and Building) or should we rather put everything in one big file? From docker_batch.py it seems to me that you want to load all results into a single dictionary, so I guess one file should be good. Can the script deal with both single and multiple files?
Finally, what's the indended meaning of "verified"? Does this mean safe/unsafe or success / tool failed?
Indeed, this is unclear. Do we need to report all instances that were not verified with a 0 or can we just not report them? Is there a difference?
In the NLN group we introduced a score (e.g., the width of a given variable at the final time) to quantify the quality of approximation when the properties are satisfied. This allows to better frame the performance of the tool.
Additionally, a tool must always satisfy the properties to provide a score and timing result. Since not being able to verify removes the lower bound on the accuracy used for computation, it does not make much sense to supply those.
Summarizing, we don't see a use for the verified field, but an optional score field would be needed. Alternatively, we can duplicate the entries for a given benchmark, having a class for metric and another one for computation time.
Here is a naming proposal for the NLN category using an additional field(s) to report the metric score as suggested by Luca:
Tool, PRDE20, I, runtime, final volume
Tool, PRDE20, P, runtime, final volume
Tool, PRDE20, IP, runtime, final volume
Tool, CVDP20, mu1, runtime
Tool, CVDP20, mu2, runtime
Tool, LALO20, W001, runtime, final width
Tool, LALO20, W005, runtime, final width
Tool, LALO20, W01, runtime, final width
Tool, QUAD20, delta01, runtime
Tool, QUAD20, delta04, runtime
Tool, QUAD20, delta08, runtime
Tool, LOVO20, , runtime, area, cntmax
Tool, SPRE20, , runtime
A verification tool can output three types of results: verified (property provably satisfied), falsified (property provably violated), and unknown. Currently, most tools in this competition only give "verified" or "unknown", but some do also falsify (e.g., in the PWC category). So I suggest using a third value (-1) for "falsified".
@Luca, I do think the verified field is useful. For example, a buggy tool could simply output the correct result by chance. The interest of falsifiable instances is to serve as a binary test. We supply two versions of the model: one that should give verified and one that should give falsified. Both should be given to the tool with all other parameters and the property being the same. If the test is set up properly, only a tool that correctly solves the problem should be able to always give the correct result, since it won't know which one of the two versions it is being given (ok, I'm assuming no malicious trickery like parsing the model to see which one it is).
Matthias I saw you commented on the other things, but what about having a single fixed-name command to run the measurements and produce the csv files (like ./measure_all). We need to agree upon the filenames for the script to work.
I agree with Matthias' suggestion, adding -1 for falsified and an instance-dependent score field.
For the score field to be at least somewhat uniform, I propose that we fix the direction: smaller is better. One could call it "precision measure" or "loss".
Tentative decision: <ToolName>, <benchmark ID>, <instance (possibly empty)>, <verified (-1/0/1)>, <comp. time>, <loss>
We request the ability to supply multiple loss measures.
By the way, since the loss field is optional,
<ToolName>, <benchmark ID>, <instance (possibly empty)>, <verified (-1/0/1)>, <comp. time>
should be a correct CSV entry with no need for a trailing comma.
I would add that, if a tool is not capable of producing a result for a particular benchmark/instance, then its CSV file still shows a corresponding entry, but with verified=0 and empty time. At least according to our group, if the specification can't be met, then no timing and loss values are meaningful.
Did we specify the output filename as well? Like results.csv? This can obviously be renamed by the high-level script that copies the results from the docker instance to the host system.
Dear all,
Taylor has kicked off the automatic evaluation of benchmarks with a nice script:
https://gitlab.com/goranf/ARCH-COMP/-/blob/master/2019/AFF/docker_batch.py
In order to avoid fetching information from the console (lines after line 167), we propose a simple CSV file structure:
<nameOfTool>\n
<string: BenchmarkID>;<bool: verified>;<double: computationTime>\n
<string: BenchmarkID>;<bool: verified>;<double: computationTime> \n
"End of "<nameOfTool> \n\n #this is just a sanity check
Example:
CORA\n
ISSF01ISU01; 1; 201.33 \n
ISSC01ISS01; 1; 50.54 \n
End of CORA
Please comment soon so that we have time to test this.