The UNIPEN collection release #1: train_r01_v07

The first batch of the UNIPEN on-line handwritten database is a 'training set', containing enough data for extracting your local training and test set configurations. An actual development and test set is retained at the iUF for future benchmarking events.

As of the Summer 2011, the train_r01_v07 set has become freely available via http://www.unipen.org/products.html No guarantees can be given concerning the truth labels. In the past, internet solutions have been used to incrementally improve the labeling quality on the basis of uploaded corrections from database users. With sufficient interest, this approach can be continued during the coming years.

Referring to UNIPEN data usage:
http://unipen.nici.kun.nl/unipen-ref.html
Conditions of use of the iUF UNIPEN CDROM
http://unipen.nici.kun.nl/cdroms/unipen-conditions-of-use.html

What do you get?

The data contained on the CDROM comprises the so-called train_r01_v07 dataset collected by NIST. The UNIPEN files contained in this release are organized in 10 categories, listed below. The number of .SEGMENTS and number of files for each category are given:

 cat   nsegm  nfiles
  1a  15953     634  isolated digits 
  1b  28069    1423  isolated upper case 
  1c  61351    2145  isolated lower case 
  1d  17286    1222  isolated symbols (punctuations etc.) 
  2  122628    2735  isolated characters, mixed case 
  3   67352    1949  isolated characters in the context of words or texts 
  4       0       0  isolated printed words, not mixed with digits and symbols 
  5       0       0  isolated printed words, full character set 
  6   75529    3298  isolated cursive or mixed-style words (without digits and symbols) 
  7   85213    3393  isolated words, any style, full character set 
  8   14544    4563  text: (minimally two words of) free text, full character set

For more info, consider the CDROM-README. For a description and examples of the UNIPEN format, see the iUF homepage (http://unipen.org).

Note

The UNIPEN data set is difficult to recognize. Although the data format is standardized, the underlying data sources are highly variable in terms of (1) tablets, (2) drivers and (3) the signal type (e.g. equidistant in time, equidistant in space, non-equidistant whatsoever). There are labeling and segmentation problems. It is our hope that these will be reduced, due to the feedback that we obtain from you. However, the basic philosophy of UNIPEN remains (as before) that the set should be representative, rather than academically cleaned according to rules and heuristics which can not be understood post hoc. Therefore, feedback on truth values in terms of: "this 8 looks too sloppy, please remove it" will not be incorporated. The quality label "BAD" of the .SEGMENT entries takes care of this problem, and hints to that effect may or may not be followed. On the contrary, reports on blatant labeling and segmentation errors are highly welcome! Such reports can be sent to us via email, using an ASCII file with the .SEGMENT entries according to you, and referring to the exact directory path of the file which has the problem.