UNIPEN Benchmark Overview


Benchmark Description Approximate count
in set train-r01-v07
unit

1a

isolated digits 16k char

1b

isolated upper case 28k char

1c

isolated lower case 61k char

1d

isolated symbols (punctuations etc.) 17k char

2

isolated characters, mixed case 123k char

3

isolated characters in the context of words or texts 67k char

4

isolated printed words, not mixed with digits and symbols - word

5

isolated printed words, full character set - word

6

isolated cursive or mixed-style words (without digits and symbols) 75k word

7

isolated words, any style, full character set 85k word

8

text: (minimally two words of) free text, full character set 16k text

(Legend: k = 1000, - = count soon available)

Note that only Benchmark #8 is a realistic, application-oriented test, because the word segmentation problem must also have been solved by the recognizer. No manual word segmentation is allowed in test Benchmark #8.


Lambert Schomaker, Jan. 1997, Dec. 1999