We currently display the input / output distribution using min, max, mean, median, p90 and p99. But it still doesn't entirely convey the extreme cases. So, producing a histogram of request count under each bucket would be ideal.
Also, produce the epsilon from the actual dataset since we input and output can be a little different from the dataset because of using only a portion of the dataset or not generating the exact number of output tokens.