Internationalisation in DITA (and how to deal with Japanese index terms)
Companies generate enormous amounts of data and technical documentation. This presents huge challenges for every business if they are to organize and access information efficiently. Unfortunately, software suites such as Microsoft Office do not facilitate the management and structuring of large volumes of content.
It is possible to find order in the chaos using a markup language such as XML, which brings us neatly to the subject of this article – DITA.
What is DITA?
DITA (Darwin Information Typing Architecture) is an open standard for authoring and publishing modular documentation. It is the fastest-growing XML standard in the technical documentation industry. Companies are seeking to switch their Word documents and spreadsheets for more advanced alternatives that better support their processing needs. There are a significant number of available solutions, but many organisations are now turning to a DITA-based model.
The DITA architecture organises content into self-contained topics. These must cover a single subject and so should be relatively brief but must be long enough to make sense in isolation. DITA topics are like building blocks that can be organised into DITA maps. The maps can then be published into various output formats, including traditional PDF and HTML and even video. This way of working is referred to as structured writing. The additional benefits of the DITA standard are:
Content reuse
DITA is designed to enable the reuse of existing content. Individual topics are the foundation of content reuse and can be recycled across and within projects. Content fragments found within those topics such as paragraphs, lists, notes and terms can be recycled by employing specific reference types such as conkeyrefs, conrefs or keyrefs.
Metadata:
The DITA standard includes extensive metadata options. Conditional text (text which should appear in certain renditions of content but not others) allows content to be published based on defined properties such as audience, platform or product.
Specialisation:
DITA is an open standard, which enables organisations to extend the specifications of the architecture in accordance with their needs and wishes.
Version control & collaboration:
Authoring tools are generally used in combination with a content management system (CMS). Combining technologies facilitates the management of different versions of content. This approach also ensures that multiple teams can collaborate efficiently and exchange information within the same environment.
Internationalisation in DITA
DITA is great for content creation and is also localisation-friendly. The architecture boasts assets that simplify the translation process as long as both content authors and localisation engineers take advantage of the potential benefits. Dita supports three internationalisation attributes: translate, xml:lang, and dir.
The translate attribute is probably the most versatile of the three. Working on a word, sentence, paragraph or topic level, enables users to set element attributes that identify which sections of text should be translated.
With xml:lang it is possible to identify the language of the corresponding element content and optionally filter out unnecessary content in the translation management system (TMS):
The dir attribute helps processors and publishing engines to render bidirectional text correctly.
DITA also enables users to tag terminology. Using the term element, it is possible to identify specialised or technical terms, abbreviations, acronyms and other forms of jargon:
Sorting index terms in Japanese
Index entries are handled differently in DITA than in traditional desktop publishing applications. The DITA standard features a dedicated <indexterm> element for terms that should be included in the index at the end of a publication. Index terms are very flexible and can be used freely inside topic bodies without being rendered in the output. However, constraint is advisable as the free use of index terms can lead to difficulties in maintenance, processing and especially translation. It is best to position index terms in the prolog section of topics, outside the topic body:
An alphabetical index is then derived from all the index terms that the content author uses throughout the different topics in a project.
When translating DITA, sorting index entries can be cumbersome depending on the target language(s). Certain languages have specific ordering conventions and so require human intervention. In Japanese, words that are entirely or partially written in Kanji must be sorted in phonetic order, in accordance with their Hiragana/Katakana rendition. There is no reliable automated process for ordering written Kanji phonetically. The only way to correctly handle the sortation of Japanese index terms is to store the phonetic counterparts with the written versions. These can then act as the source terms for sortation. To aid this process, the DITA element repository includes a dedicated <index-sort-as> element:
Unfortunately, <index-sort-as> elements are usually not available in source DITA topics. This is because they are often considered to be localisation-specific features, particularly when Japanese is only one of many target languages. Hence, the elements must be added in the translation phase, which is challenging for engineers. Localisation engineers must ensure that the <index-sort-as> elements are added to the source DITA topics in the right places. They must then parse them correctly in the TMS and provide linguists with accurate guidelines.
Conclusion
The open DITA standard is the fastest growing XML standard for modular documentation in the technical writing industry. The topic-based architecture enables content authors to save both money and time while creating, managing and publishing content. The Dita standard includes a couple of features that help to streamline the localisation process. Nonetheless, a number of obstacles can still arise when translating DITA depending on the target languages.
In Japanese, for example, there are approximately 2,000 Kanji characters that can’t be used in the sortation of index terms to create alphabetical indexes. Localisation engineers are required to tackle this challenge while avoiding any impact on the quality of the completed
 index.