.COMMENT #################################### # UNIPEN 1.0 FORMAT DEFINITION # #################################### # ----- Copyright (c) 1994, Isabelle Guyon, AT&T Bell Laboratories ------ # # # # DISCLAIMER: # # # # USER SHALL BE FREE TO USE AND COPY THIS SOFTWARE FREE OF CHARGE OR # # FURTHER OBLIGATION. # # # # THIS SOFTWARE IS NOT OF PRODUCT QUALITY AND MAY HAVE ERRORS OR # # DEFECTS. # # # # PROVIDER GIVES NO EXPRESS OR IMPLIED WARRANTY OF ANY KIND AND ANY # # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR PURPOSE ARE # # DISCLAIMED. # # # # PROVIDER SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, # # INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THIS # # SOFTWARE. # # # The format is self-defined from 3 basic keywords: .COMMENT, .RESERVE and .KEYWORD A - DATA TYPES ---------- .RESERVE [N] Integer or decimal number represented by digits separated by a dot; may start with a sign; no commas allowed. .RESERVE [S] String: any combination of keyboard ASCII symbols, except space, new-line, tabulations and words starting by a dot in the first column. .RESERVE [F] Free text: a succession of strings separated by space, new-line and tab. .RESERVE [R] Reserved string: a string which has a special meaning for the UNIPEN format, as defined in the reserved string glossary. .RESERVE [L] Label: a string enclosed between double quotes which may contain spaces new-lines or tabulations, all counted as spaces; the escape character is backslash; inside a label, double quotes should be replaced by \", backslash by \\, tabulations by \t and new-lines by \n. .RESERVE [.] Repeat the last type until a new type is indicated. .RESERVE [+] Repeat all preceding types any number of times. .COMMENT B - KEYWORDS -------- .KEYWORD .KEYWORD [S] [R] [.] [F] Define a new keyword: keyword, argument types, documentation. .KEYWORD .RESERVE [S] [F] Define a new reserved string: reserved string, documentation. .KEYWORD .COMMENT [F] Comments for human reading, to be ignored by the machine parser. .KEYWORD .INCLUDE [S] Name of file to be included as header (e.g. documentation or lexicon file). ** Do no put the PATH. ** No include file should contain another include file. .COMMENT --------- Mandatory declarations ------------------------------------ .KEYWORD .VERSION [N] MANDATORY version number of the format (current version 1.0). .KEYWORD .DATA_SOURCE [S] MANDATORY name of institution or person where the data came from. .KEYWORD .DATA_ID [S] Name of this database. .KEYWORD .COORD [R] [.] Declaration of the coordinates used in .PEN_DOWN and .PEN_UP components, a subset of: X, Y, T, P, Z, B, RHO, THETA, PHI, including at least X and Y. .KEYWORD .HIERARCHY [S] [.] Declaration of segmentation hierarchy used by .SEGMENT. Examples of arguments may be: 0, 1, 2, ..., DOCUMENT, TEXT, PARAGRAPH, PAGE, LINE, SENTENCE, WORD, FORMULA, CHARACTER, STROKE, SHEET, GLYPH, DIACRITICAL, GESTURE, KEY, etc. This list is not limitative. A typical hierarchy is: .HIERARCHY SENTENCE WORD CHARACTER. .COMMENT --------- Data documentation ---------------------------------------- .KEYWORD .DATA_CONTACT [F] Where to reach the person responsible to answer questions about the database. .KEYWORD .DATA_INFO [F] Nature and structure of the data. .KEYWORD .SETUP [F] Data collection recording conditions. .KEYWORD .PAD [F] Data collection device. .COMMENT --------- Alphabet ------------------------------------------------- .KEYWORD .ALPHABET [L] [.] Declaration of all characters used in data labels. In this version of the UNIPEN format, characters are restricted to English keyboard ASCII. "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "~" "!" "@" "#" "$" "%" "^" "&" "*" "(" ")" "-" "+" "=" "|" "\\" "/" "{" "}" "?" "[" "]" "\"" ":" "<" ">" "," "." ";" "'" "`" "_" "\ " A broader set of characters will be allowed in the next versions. .KEYWORD .ALPHABET_FREQ [N] [.] Natural frequencies of characters in the data (need not add up to one). .COMMENT --------- Lexicon --------------------------------------------------- .KEYWORD .LEXICON_SOURCE [S] Name of institution or person where the lexicon came from. .KEYWORD .LEXICON_ID [S] Name of the lexicon. .KEYWORD .LEXICON_CONTACT [F] Where to reach the person responsible to answer questions about the lexicon. .KEYWORD .LEXICON_INFO [F] Informations about the lexicon. .KEYWORD .LEXICON [L] [.] Representative set of class labels found in the database, generally at the word or character segmentation level. .KEYWORD .LEXICON_FREQ [N] [.] Frequencies of lexical entries defined by .LEXICON. Lexical frequencies characterize the distribution from which data samples were drawn at random. Therefore, the number of times a lexical entry appears in the database should be approximately proportional to the lexical frequencies. Normalizing such that all numbers add-up to one is not necessary. .COMMENT --------- Data layout ----------------------------------------------- .KEYWORD .X_DIM [N] Width of the bounding box, in pixels (using the resolution of the input device, not that of the display). .KEYWORD .Y_DIM [N] Height of the bounding box, in pixels. .KEYWORD .H_LINE [N] [.] Distance in pixels between the bottom of the bounding box and horizontal guidelines, such as a baseline. .KEYWORD .V_LINE [N] [.] Distance in pixels between the left edge of the bounding box and vertical guidelines. .COMMENT --------- Unit system --------------------------------------------- .KEYWORD .X_POINTS_PER_INCH [N] x resolution of the data collection device (1 inch ~ 2.5 cm). .KEYWORD .Y_POINTS_PER_INCH [N] y resolution of the data collection device. .KEYWORD .Z_POINTS_PER_INCH [N] z (altitude) resolution of the data collection device. .KEYWORD .X_POINTS_PER_MM [N] x resolution of the data collection device (in SI units). .KEYWORD .Y_POINTS_PER_MM [N] y resolution of the data collection device. .KEYWORD .Z_POINTS_PER_MM [N] z (altitude) resolution of the data collection device. .KEYWORD .POINTS_PER_GRAM [N] Pressure resolution of the data collection device. .KEYWORD .POINTS_PER_SECOND [N] Sampling rate, MANDATORY if T not in .COORD. .COMMENT ------- Pen trajectory ---------------------------------------------- .KEYWORD .PEN_DOWN [N] [.] Pen down component: repeated sequences of coordinates as defined by .COORD, pen touching the pad surface. .KEYWORD .PEN_UP [N] [.] Pen up component: same as .PEN_DOWN, but with the pen not touching the pad surface. .KEYWORD .DT [N] Elapsed time measured when pen coordinates are elided because the pen was immobile or out of proximity of the pad sensors. .COMMENT -------- Data annotations ------------------------------------------- .KEYWORD .DATE [N] [N] [N] Date stamp: month, day, year. .KEYWORD .STYLE [R] PRINTED, CURSIVE or MIXED. .KEYWORD .WRITER_ID [S] MANDATORY unique writer identification. .KEYWORD .COUNTRY [S] Country of origin. .KEYWORD .HAND [R] Writer hand: L for left, R for right. .KEYWORD .AGE [N] Writer age, in years. .KEYWORD .SEX [R] Writer sex: M for male, F for female. .KEYWORD .SKILL [R] Skill of writer, familiarity with input device: BAD, OK or GOOD. .KEYWORD .WRITER_INFO [F] Misc. information about writer. .KEYWORD .SEGMENT [S] [R] [R] [L] Type of segment, its delineation, quality and label. -> First argument: type of segment, such as the ones declared in .HIERARCHY (e. g. SENTENCE, WORD, CHARACTER, etc.). -> Second argument: segment delineation by a A[:M]]-[B[:N]],[C] expression (see reserved string glossary). Components are numbered in order of apparition in the present data set, starting from zero. The component counter is reset to zero at each beginning of new file and each .START_SET. Empty pen streams (or "components") are NOT counted. If the segment delineation is NON ambiguous, the second argument may be either replaced by ? or omitted, if ALL following arguments are omitted. -> Third argument: quality level, BAD for illegible, OK for regular, GOOD for superior, ? for unknown. The quality may be omitted only if the fourth argument is also omitted. -> Fourth argument: label (Sentence, word, character, etc.) The label may be omitted. .KEYWORD .START_SET [S] Start a new set; the argument is the set name. The component counter is reset to zero and the lexicon is deleted. In the absence of .START_SET, the component counter is automatically reset to zero at the beginning of each file and the set name is the file name. .KEYWORD .START_BOX [.] Erase all components from the previous data bounding box, and start a new one (useful for browsing). No argument. In the absence of .START_BOX, segmentation points of highest hierarchy level will be used. .COMMENT -------- Recognizer documentation ----------------------------------- .KEYWORD .REC_SOURCE [S] MANDATORY name of institution or person where the recognizer came from. .KEYWORD .REC_ID [S] MANDATORY recognizer name. .KEYWORD .REC_CONTACT [F] Where to reach the person responsible to answer questions about the recognizer. .KEYWORD .REC_INFO [F] Nature and structure of the recognizer, number of free parameters, number of training examples. .KEYWORD .IMPLEMENT [F] Recognizer implementation, software and hardware. .COMMENT -------- Recognizer declarations ------------------------------------ .KEYWORD .TRAINING_SET [S] [S] [S] [R] [+] Training data set. -> First argument: data source, from .DATA_SOURCE. -> Second argument: database name, from .DATA_ID. -> Third argument: data set name, from .START_SET or the file name. -> Fourth argument: segment of data delineated by a [A[:M]]-[B[:N]],[C] expression (see reserved string glossary). The four arguments types may be repeated any number of times. .KEYWORD .TEST_SET [S] [S] [S] [R] [+] MANDATORY test set. Always disjoint from the training set. Same argument types as for .TRAINING_SET. .KEYWORD .ADAPT_SET [S] [S] [S] [R] [+] Writer adaptation set on which the recognizer was fine tuned to perform best on a particular writer. Same argument types as for .TRAINING_SET. .KEYWORD .LEXICON_SET [S] [S] [R] [+] Lexicon used by the recognizer. -> First argument: lexicon source, from .LEXICON_SOURCE. -> Second argument: lexicon name, from .LEXICON_ID. -> Third argument: a [N]-[N],[N] expression (see reserved string glossary) defining the subset of lexical entries used. The second argument may be omitted if only one lexicon is used. .COMMENT -------- Recognition results ---------------------------------------- .KEYWORD .REC_TIME [R] [N] Delineation of a data segment, and recognition time. -> First argument: segment of data delineated by a [A[:M]]-[B[:N]],[C] expression (see reserved string glossary). -> Second argument: REAL recognition time (in seconds) it took with the implementation described in .IMPLEMENT. .KEYWORD .REC_LABELS [S] [R] [R] [L] [.] Segment type, its delineation, its acceptance decision and its recognition labels. -> First argument: type of segment, such as the ones declared in .HIERARCHY (e. g. SENTENCE, WORD, CHARACTER, etc.). -> Second argument: segment delineation, as in .SEGMENT. -> Third argument: ACCEPT, REJECT or ? for unknown. -> Following arguments: labels (Characters, words, sentences, etc.) in order of decreasing likelihood (best guess first). Labels may be omitted if the second argument is REJECT. .KEYWORD .REC_SCORES [S] [R] [N] [N] [.] Segment type, its delineation, its acceptance score and its recognition scores. -> First argument: type of segment, such as the ones declared in .HIERARCHY (e. g. SENTENCE, WORD, CHARACTER, etc.). -> Second argument: segment delineation as in .SEGMENT. -> Third argument: acceptance score, high => ACCEPT, low => REJECT, 0 if unknown. -> Following arguments: scores for the labels given in .REC_LABELS in order of decreasing values (highest score = best guess, first). .COMMENT C - RESERVED STRING GLOSSARY ------------------------ (data types are also reserved, see section I) .RESERVE [N]-[N],[N] Compact representation of a list of numbers, used by .LEXICAL_SET, .SEGMENT, .REC_TIME, .REC_LABELS, .REC_SCORES. The list: 2, 3, 4, 5, 15, 9, 50, 51, 52 ,53, 54, 55 would be represented as: 2-5,15,9,50-55. Commas allow non contiguous numbers (useful for segmentation of delayed strokes, i dots and t bars). NO SPACES are allowed in the notation. .RESERVE [A[:M]]-[B[:N]],[C] More flexible representation to delineate segments of data by breaking components. A,B and C are component numbers, M and N are point numbers in the component. Both components and points are 0-base. L defaults to zero and M to the last point in the component. Thus the [N]-[N],[N] expressions are special cases of a [A[:M]]-[B[:N]],[C] expression. The example 1:40-3,5,6:0-6:12 delineates component 1 from point 40 to the end, all of component 2, 3 and 5 and component 6 from the beginning to point 12. NO SPACES are allowed in the notation. .RESERVE X X position of the pen on the pad surface, in units of X given by .X_POINTS_PER_INCH .RESERVE Y Y position of the pen on the pad surface, in units of Y given by .Y_POINTS_PER_INCH .RESERVE T Time in MILLISECONDS. .RESERVE P Pressure in units of P given by .UNITS_PER_GRAM. Use preferably a linearized pressure in 1000 units per gram of force and calibrate the zero as the threshold of pen reaching the pad surface. Negative pressures can account for remaining non- linearities and hysteresis. .RESERVE Z Altitude above the pad surface, in units of Z given by .Z_POINTS_PER_INCH. .RESERVE BUTTON Barrel button states: 0, 1, ... .RESERVE RHO Rotational angle of the stylus, measured in degrees from some nominal orientation of the stylus (e. g. barrel button on top). The angle increases with clockwise rotation as seen from the rear end of the stylus. .RESERVE THETA XY angle of the stylus, measured in degrees, increasing from the X axis in the counter-clockwise direction. .RESERVE PHI Z angle of the stylus, increasing from the pad surface, in the positive Z direction. .RESERVE L Left handed. .RESERVE R Right handed. .RESERVE M Male. .RESERVE F Female. .RESERVE BAD Unskilled writer, illegible writing. .RESERVE OK Average quality writing, unambiguously legible. .RESERVE GOOD Superior quality writing, most recognizers should get it. .RESERVE ? Unknown. .RESERVE PRINTED Printed handwriting style. .RESERVE CURSIVE Cursive handwriting style. .RESERVE MIXED Mixed printed and cursive handwriting style. .RESERVE ACCEPT Accepted by recognizer. .RESERVE REJECT Rejected by recognizer.