from dart_math.eval import *
= EvaluatorMathBatch() math_evaluator
WARNING 12-10 04:54:47 _custom_ops.py:14] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
WARNING 12-10 04:54:47 _custom_ops.py:14] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
EvaluatorMath (strict_extract:bool=False, use_orig_eq_for_olympiadbench:bool=True, include_percentage:bool=True, rel_tol:float=1e-09, abs_tol:float=1e-08, percent_rel_tol:float=0.001, ascii_only:bool=True)
Evaluator for math problems, capable of extracting answer segment from complex resp and processing various mathematical objects (e.g. fractions, symbolic expressions, matrices, vectors) and special text (e.g. bool values).
Type | Default | Details | |
---|---|---|---|
strict_extract | bool | False | |
use_orig_eq_for_olympiadbench | bool | True | Whether to use the original implementation of eq for OlympiadBench.For OlympiadBench, by default, we use the official implementation of eq by He et al. (2024),which utilizing the numerical error range information provided with query, but keep the extract_nas of ours,because the official implementation fails to extract a non-negligible part of answers, especially for base model ICL. You could set use_orig_eq_for_olympiadbench to False to use our implementation of eq for better consistency across benchmarks in our evaluation setting. |
include_percentage | bool | True | Whether to include percentage comparisons. |
rel_tol | float | 1e-09 | The relative tolerance for numerical comparisons. |
abs_tol | float | 1e-08 | The absolute tolerance for numerical comparisons. Necessary for precision issues. |
percent_rel_tol | float | 0.001 | The relative tolerance for percentage comparisons. Relative for different surface forms (e.g. 99% v.s. 0.99). |
ascii_only | bool | True | Only allowing ASCII characters |
EvaluatorMath
implements an elaborate evaluation pipeline for mathematical reasoning tasks.
EvaluatorMath
can:
# Answer around "answer"
math_evaluator.extract_ans(
"Both $1$ and $11$ divide $11,$ so $\\boxed{11}=2$, and since $1,$ $2,$ $4,$ $5,$ $10,$ and $20$ divide $20,$ then $\\boxed{20}=6$. The inner expression, $\\boxed{11}\\times\\boxed{20}=2\\times6=12$. Finally, $\\boxed{12}=6$ because $1,$ $2,$ $3,$ $4,$ $6,$ and $12$ divide $12.$\n\nTherefore, $6$ is our answer. Please note that we have not boxed the correct answer as we normally do, as that would be especially confusing for this problem."
)
'6'
# Use the last number by default
math_evaluator.extract_ans(
'First, we need to count the total number of letters in the word "CIRCLE". There are 6 letters.\n\nNext, we need to count the number of distinct letters. There are 6 distinct letters in the word "CIRCLE": C, I, R, L, E, and G.\n\nNow, let\'s consider the arrangements of the distinct letters. The number of ways to arrange n distinct items is n factorial (n!). So, we have 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720 ways to arrange the distinct letters.\n\nHowever, the word "CIRCLE" has one letter that repeats (the letter \'C\' repeats twice). We have over-counted the number of distinct arrangements by including arrangements that are just rotations of each other (for example, "CIRCLE" and "LCIRCE" are considered different arrangements here, but they are the same word when read).\n\nTo correct for this, we divide the total number of arrangements by the number of ways to arrange the repeated letters. The number of ways to arrange 2 identical items is 2! = 2 × 1 = 2. So, we divide the total number of arrangements by 2 to get the correct number of distinct arrangements.\n\nTherefore, the number of ways to arrange the letters of the word "CIRCLE" is 720 ÷ 2 = 360.'
)
# More cases ...
'360'
EvaluatorMath
, based on regular expressions and SymPy symbolic calculation, is able to correctly process
More test cases:
test_eq(math_evaluator.eq("251,7\\\\ \\noindent", "0"), False)
test_eq(math_evaluator.eq("3.54*10^{-7}", "3.54e-07"), True)
test_eq(math_evaluator.eq(r"\frac{1}{2}", "0.5"), True)
test_eq(math_evaluator.eq("1", "100"), False)
test_eq(math_evaluator.eq("100", "1"), False)
test_eq(math_evaluator.eq("3.04", "0.0304", False), True)
test_eq(math_evaluator.eq(["0.0304", 0.0304], "3.04"), True)
test_eq(math_evaluator.eq("x<-1", "x>3"), False)
test_eq(
math_evaluator.eq("(-\\infty,0)\\cup(0,\\infty)", "(-\\infty,0)\\cup(0,\\infty)"),
True,
)
test_eq(math_evaluator.eq("1+2,2+1", "2+1,1+2"), True)
test_eq(math_evaluator.eq(5, 5), True)
test_eq(math_evaluator.eq(0.1 + 0.2, 0.3), True) # `0.1 + 0.2 == 0.3` is `False`
test_eq(math_evaluator.eq("x + y", "y + x"), True)
test_eq(math_evaluator.eq("C", "C"), True)
test_eq(math_evaluator.eq("1,234", "1234"), True)
test_eq(math_evaluator.eq("12,34", "(12,34)"), True)
test_eq(math_evaluator.eq("\\$ 5", "5"), True)
test_eq(math_evaluator.eq("3 * \\sqrt{13}", "3\\sqrt{13}"), True)
test_eq(math_evaluator.eq("\\pi/2", "\\frac{\\pi}{2}"), True)
test_eq(math_evaluator.eq("(3,\\pi/2)", "(3,\\frac{\\pi}{2})"), True)
test_eq(math_evaluator.eq("23000", "\\$23{,}000"), True)
test_eq(
math_evaluator.eq(r"\left(1,2\right)", r"\left(2,1\right)", compare_sets=True), True
)
test_eq(math_evaluator.eq("White", "white"), True)
test_eq(math_evaluator.eq("[0,3)", "[0,1]"), False)
test_eq(math_evaluator.eq("[0,1]", "[0,3)"), False)
test_eq(math_evaluator.eq("1001.5", "1001"), False)
test_eq(math_evaluator.eq("\\frac{2003}{2}", "1001"), False)
EvaluatorMathBatch (strict_extract:bool=False, use_orig_eq_for_olympiadbench:bool=True, include_percentage:bool=True, rel_tol:float=1e-09, abs_tol:float=1e-08, percent_rel_tol:float=0.001, ascii_only:bool=True, timeout:int=5)
Batch evaluator for math problems, capable of extracting answer segment from complex resp and processing various mathematical objects (e.g. fractions, symbolic expressions, matrices, vectors) and special text (e.g. bool values).
Type | Default | Details | |
---|---|---|---|
strict_extract | bool | False | |
use_orig_eq_for_olympiadbench | bool | True | Whether to use the original implementation of eq for OlympiadBench.For OlympiadBench, by default, we use the official implementation of eq by He et al. (2024),which utilizing the numerical error range information provided with query, but keep the extract_nas of ours,because the official implementation fails to extract a non-negligible part of answers, especially for base model ICL. You could set use_orig_eq_for_olympiadbench to False to use our implementation of eq for better consistency across benchmarks in our evaluation setting. |
include_percentage | bool | True | Whether to include percentage comparisons. |
rel_tol | float | 1e-09 | The relative tolerance for numerical comparisons. |
abs_tol | float | 1e-08 | The absolute tolerance for numerical comparisons. Necessary for precision issues. |
percent_rel_tol | float | 0.001 | The absolute tolerance for percentage comparisons. |
ascii_only | bool | True | Only allowing ASCII characters |
timeout | int | 5 |
SymPy symbolic calculation causes risks of ex-long evaluation time.
To address this, we implement EvaluatorMathBatch
to evaluate in batch with timeout but still efficiently (based on asyncio
coroutines instead of multiprocessing
in previous implementations).
EvaluatorBase (strict_extract:bool=False)
Base class for evaluators.
EvaluatorBatchBase (strict_extract:bool=False, timeout:int=5)
Base class for batch evaluators, providing additional method for batch evaluation.
Type | Default | Details | |
---|---|---|---|
strict_extract | bool | False | |
timeout | int | 5 | The timeout for each evaluation in seconds. |
\(\displaystyle \left(-\infty, 0\right) \cup \left(0, \infty\right)\)
\(\displaystyle \left[\begin{matrix}\sqrt{400 \cos^{2}{\left(\frac{9 \pi}{44} \right)}} & \frac{\pi}{4}\end{matrix}\right]\)
test_eq(math_evaluator.norm_pm("x\\pmy"), "x-y,x+y")
test_eq(math_evaluator.norm_pm("a\\mpb"), "a-b,a+b")
test_eq(math_evaluator.norm_pm("1\\pm\\sqrt{19}"), "1-\\sqrt{19},1+\\sqrt{19}")
test_eq(math_evaluator.norm_pm(r"\{1\pm\sqrt{5},-2\}"), "1-\\sqrt{5},1+\\sqrt{5},-2")
test_eq(
math_evaluator.norm_pm("\\(\\frac{1\\pm\\sqrt{17}}{4}\\)"),
"\\frac{1-\\sqrt{17}}{4},\\frac{1+\\sqrt{17}}{4}",
)
test_eq(
math_evaluator.norm_pm(r"\frac{1\pm\sqrt{1-\frac{2}{\sqrt{3}}}}{1}"),
"\\frac{1-\\sqrt{1-\\frac{2}{\\sqrt{3}}}}{1},\\frac{1+\\sqrt{1-\\frac{2}{\\sqrt{3}}}}{1}",
)