Using SYSTRAN Dictionary Manager’s (SDM’s) Import feature, you can open dictionaries created with a spreadsheet application, such as Microsoft Excel, or a common text editor. These dictionaries must be carefully formatted before they can be imported into SDM.
Microsoft Excel files
To import dictionaries created with Microsoft Excel, the files must consist of two worksheets named for the tabs in the UD: Multilingual and Do Not Translate.
As with formatted text files, the Microsoft Excel file column headings for the Languages and information columns for the UD must be entered as you want them to appear in SDM.
Sample Excel Spreadsheet
After the Excel file is imported, it appears in SDM as shown below:
Formatted text files
Formatted text files for import into SDM include the document header and the dictionary content.
- The header part of the dictionary is a sequence of lines starting with the “#” character and containing a header field followed by its value.
- The content part is a sequence of lines, with each line representing a dictionary entry whose fields are separated by tab characters
The field types are defined in the header. It is important that each line has the same number of fields, even if they are empty.
Required and optional fields for importing files into SDM
|Header||Description of Input|
|#AUTHOR=||Optional: contains the name of the creator of the dictionary.|
|#EMAIL=||Optional: contains the email address of the creator of the dictionary.|
|#COVERED DOMAINS=||Optional header: lists all domains configured in the dictionary.|
|#ENCODING=||Required: defines the encoding of the file. UTF-8 encoding is recommended.|
|#GENERAL DICTIONARY DOMAINS=||Optional header: lists the system domains associated with the dictionary.|
|#SUMMARY=||Required: the name of the UD file.|
|Required: These two lines are the end of the header section.
#MULTI defines that the dictionary is a User Dictionary,
#TM defines that the dictionary is a Translation Memory, #NORM defines that the dictionary is a Normalization Dictionary.
#DNT is used to separate in a User Dictionary, multilingual entries from DNT entries.
The second line describes the list of columns in the content section. It is a list of codes separated by tab characters as described in the following table.
Description of the different codes defining the content fields
|XX||Where XX is a 2-letter ISO 639 code in uppercase. This represents a language (see Appendix B. Language Pairs and ISO 639 Codes). The source language is always the first column, with target languages as the following columns.|
|XX_NO||For Normalization Dictionaries only. XX corresponds to the ISO 639 code for the source language. These columns represent the Normalized columns.|
|UPOS||User Part of Speech. This entry corresponds to the SDM Category column.|
|HEADWORD_XX||This column is generated when doing an export. It contains the headword of the corresponding XX field. During import, this column is ignored.|
|DOMAINS||Domains column. Domains are comma separated.|
|PROPOSAL STATUS||Status of the entry (the entry automatically extracted has a candidate status).|
|COMMENT||Additional comment on the entry.|
|EXTRACTION CONFIDENCE||Applies to automatically extracted entries; confidence of the extraction in an escalating scale of 0-1.|
|PREVIOUS TRANSLATION||Applies to automatically extracted entries; the default SYSTRAN translation.|
Sample formatted text file
The following sample text file is formatted for importing as a User Dictionary into SDM. Note that <TAB> indicates the tab character.
#ENCODING=UTF-8 #AUTHOR=SYSTRAN #[email protected] #COVERED DOMAINS=Computers/Data Processing,Perso #GENERAL DICTIONARY DOMAINS=Computers/Data Processing #PRIORITY=1 #SUMMARY=Demo Computer #MULTI #EN<TAB>FR<TAB>NOTE<TAB>DOMAINS<TAB>PRIORITY<TAB>UPOS write cycle<TAB>cycle d'écriture<TAB>Note<TAB>1<TAB>noun write enable<TAB>validation écriture<TAB><TAB><TAB>noun #DNT #EN<TAB>NOTE<TAB>DOMAINS Print 2000<TAB>It is a DNT<TAB>Perso
The following sample text file is formatted for importing into SDM as a Translation Memory.
#AUTHOR=SYSTRAN #[email protected] #ENCODING=UTF-8 #SUMMARY=Demo #TM #EN<TAB>FR<TAB>DE My name is Smith<TAB>Mon nom est Smith<TAB>Mein Name ist Smith