Error evaluating TS/Java #173

memray · 2023-07-23T06:49:20Z

memray
Jul 23, 2023

Hi,

I'm using MultiPL-E evaluating models on different languages (latest version and it passed make test). But I found the scores on Java and TypeScript are not quite aligned with the trend on others. I checked some outputs and noticed that a lot of errors are SyntaxError, and the code is logically correct.
(1) Is it possible that something is wrong with the env? For example, import org.javatuples.*; is very suspicious, since the code often runs well after removing this import.
(2) Besides status, can we also log the complete error outputs from the compiler/executor, so we can have better ideas about what errors happened.

Thank you!
Rui

Example of Java (HumanEval_28_concatenate):

    {
      "program": "import java.util.*;\nimport java.lang.reflect.*;\nimport org.javatuples.*;\nimport java.security.*;\nimport java.math.*;\nimport java.io.*;\nimport java.util.stream.*;\nclass Problem {\n    // Concatenate array list of strings into a single string\n    // >>> concatenate((new ArrayList<String>(Arrays.asList())))\n    // (\"\")\n    // >>> concatenate((new ArrayList<String>(Arrays.asList((String)\"a\", (String)\"b\", (String)\"c\"))))\n    // (\"abc\")\n    public static String concatenate(ArrayList<String> strings) {\n        StringBuilder sb = new StringBuilder();\n        for (String s : strings) {\n            sb.append(s);\n        }\n        return sb.toString();\n    }\n    public static void main(String[] args) {\n    assert(concatenate((new ArrayList<String>(Arrays.asList()))).equals((\"\")));\n    assert(concatenate((new ArrayList<String>(Arrays.asList((String)\"x\", (String)\"y\", (String)\"z\")))).equals((\"xyz\")));\n    assert(concatenate((new ArrayList<String>(Arrays.asList((String)\"x\", (String)\"y\", (String)\"z\", (String)\"w\", (String)\"k\")))).equals((\"xyzwk\")));\n    }\n\n}\n",
      "timestamp": 1690088869,
      "stdout": "",
      "stderr": "",
      "exit_code": -1,
      "status": "SyntaxError"
    },

Example of TS (HumanEval_0_has_close_elements):

    {
      "program": "//Check if in given array of numbers, are any two numbers closer to each other than\n// given threshold.\n// >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n// false\n// >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n// true\nfunction has_close_elements(numbers: number[], threshold: number): boolean {\n  for (let i = 0; i < numbers.length - 1; i++) {\n    for (let j = i + 1; j < numbers.length; j++) {\n      if (Math.abs(numbers[i] - numbers[j]) < threshold) {\n        return true;\n      }\n    }\n  }\n  return false;\n}\n\ndeclare var require: any;\nconst assert = require('node:assert');\n\n\nfunction test() {\n  let candidate = has_close_elements;\n  assert.deepEqual(candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3),true);\n  assert.deepEqual(candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05),false);\n  assert.deepEqual(candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95),true);\n  assert.deepEqual(candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8),false);\n  assert.deepEqual(candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1),true);\n  assert.deepEqual(candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0),true);\n  assert.deepEqual(candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5),false);\n}\n\ntest();",
      "timestamp": 1690088212,
      "stdout": "",
      "stderr": "",
      "exit_code": -1,
      "status": "SyntaxError"
    },

arjunguha · 2023-07-23T19:45:36Z

arjunguha
Jul 23, 2023
Maintainer

I've seen this kind of failure before. The system does record stdout and stderr as the output above suggests. However, there are certain catastrophic cases where nothing is recorded, and you get:

   "stdout": "",
      "stderr": "",
      "exit_code": -1,
      "status": "SyntaxError"

First question -- are you using the MultiPL-E container?

0 replies

memray · 2023-07-23T19:50:14Z

memray
Jul 23, 2023
Author

Yes, and this is the command I used and it is on MacOS
podman run --pids-limit -1 --rm --network none --volume $SAMPLE_DIR:/inputs:ro --volume $SAMPLE_DIR:/outputs:rw multipl-e-evaluation --dir /inputs --output-dir /outputs --recursive --max-workers 3
And I felt the results have changed (towards this bad output) since I pulled the latest docker. Not sure what happened.

0 replies

arjunguha · 2023-07-23T20:49:12Z

arjunguha
Jul 23, 2023
Maintainer

Yeah, this is an annoying type of error that has been hard to diagnose. It is usually obvious when it happens: there is an error, but nothing recorded in the output files.

We have this script that looks for it in the files:

https://github.com/nuprl/MultiPL-E/blob/main/find_potential_faults.py

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error evaluating TS/Java #173

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Error evaluating TS/Java #173

Uh oh!

memray Jul 23, 2023

Replies: 3 comments

Uh oh!

arjunguha Jul 23, 2023 Maintainer

Uh oh!

memray Jul 23, 2023 Author

Uh oh!

arjunguha Jul 23, 2023 Maintainer

memray
Jul 23, 2023

arjunguha
Jul 23, 2023
Maintainer

memray
Jul 23, 2023
Author

arjunguha
Jul 23, 2023
Maintainer