The first batch of the UNIPEN on-line handwritten database is a 'training set', containing enough data for extracting your local training and test set configurations. An actual development and test set is retained at the iUF for future benchmarking events.
As of the Summer 2011, the train_r01_v07 set has become freely available via http://www.unipen.org/products.html No guarantees can be given concerning the truth labels. In the past, internet solutions have been used to incrementally improve the labeling quality on the basis of uploaded corrections from database users. With sufficient interest, this approach can be continued during the coming years.
Referring to UNIPEN data usage:
http://unipen.nici.kun.nl/unipen-ref.html
Conditions of use of the iUF UNIPEN CDROM
http://unipen.nici.kun.nl/cdroms/unipen-conditions-of-use.html
cat nsegm nfiles 1a 15953 634 isolated digits 1b 28069 1423 isolated upper case 1c 61351 2145 isolated lower case 1d 17286 1222 isolated symbols (punctuations etc.) 2 122628 2735 isolated characters, mixed case 3 67352 1949 isolated characters in the context of words or texts 4 0 0 isolated printed words, not mixed with digits and symbols 5 0 0 isolated printed words, full character set 6 75529 3298 isolated cursive or mixed-style words (without digits and symbols) 7 85213 3393 isolated words, any style, full character set 8 14544 4563 text: (minimally two words of) free text, full character set
For more info, consider the CDROM-README. For a description and examples of the UNIPEN format, see the iUF homepage (http://unipen.org).
The UNIPEN data set is difficult to recognize. Although the data format is standardized, the underlying data sources are highly variable in terms of (1) tablets, (2) drivers and (3) the signal type (e.g. equidistant in time, equidistant in space, non-equidistant whatsoever). There are labeling and segmentation problems. It is our hope that these will be reduced, due to the feedback that we obtain from you. However, the basic philosophy of UNIPEN remains (as before) that the set should be representative, rather than academically cleaned according to rules and heuristics which can not be understood post hoc. Therefore, feedback on truth values in terms of: "this 8 looks too sloppy, please remove it" will not be incorporated. The quality label "BAD" of the .SEGMENT entries takes care of this problem, and hints to that effect may or may not be followed. On the contrary, reports on blatant labeling and segmentation errors are highly welcome! Such reports can be sent to us via email, using an ASCII file with the .SEGMENT entries according to you, and referring to the exact directory path of the file which has the problem.