Tech Language logo

Pre-translation checklist


Many documents that are sent for translation are formatting nightmares.

There are two main sources of errors.  Writers who don't know how to use their word processing program are the biggest one.

For example, a user might create a “table” or column layout on a page by using lots of tab characters and spaces. If the text is translated as-is, the spacing will in all likelihood be ruined.  Also, the user may have broken apart sentences and text strings that should have remained together. While translating using a TM program, this will not be obvious. For these reasons, it's almost always worth spending some time before translation to convert these sections into a well-structured Word table.

The other main source of problems is PDF conversion. Many source documents come to us in PDF format. It's often best to use a program such as FineReader to convert them to a word-processing format, such as Microsoft Word. The process is a great time saver, but also introduces errors.

Most documents are not plain text, but include formatting such as bold, italic, a table of contents or a page number. DV handles this information by converting it to codes, displayed as {001}. Many of those codes are necessary and unavoidable. But users and PDF converters can each introduce “rogue codes”. Whether intended or not, those codes prevent TM programs from recognizing “exact matches”, or otherwise require more time to handle during translation. It's best to reduce those codes to a minimum. Some of the most common methods are addressed in the list below.

Whatever the source, almost every document requires cleanup before I import it into my translation memory program. Here’s a partial checklist:

  1. Fix irregular character spacing and font sizes. Frequently, select the entire document and set the character spacing (scale, spacing, and position) to the normal values. (However, this doesn't always necessarily reset the spacing of every character.)
  2. Misuse of multiple spaces and/or tab characters. I always do a search for all instances of two or more spaces, tabs, or any combination. That helps to find the following:
  3. Search for all sequences of two (sometimes three) paragraph marks. Users often force a new page by pressing the <Enter> key 20 times. Replace these instances with a manual page break.
  4. Usually set the paragraph spacing to insert one line before or each paragraph. Usually I use a value of “1 line” instead of a value in pts (points). Delete most multiple paragraph marks.
  5. Make sure that headings and subheadings are set to “Keep with next”, to make sure they stay on the same page as the following text. This is helped a great deal by the previous item, above.
  6. Just a preference really, but turn on “Widow/orphan control”.
  7. Search for spaces before these marks: . , ; : ? ! (will vary by language)
  8. Most of the time, convert floating objects to inline shapes using Miri Orfek's macro, “Images out/in”  In my opinion they're easier to control that way. Also, too many images can make the import into DV take a long time.
  9. Convert manually created Tables of Contents to automatic versions.
  10. Handle Tab characters, as described here.
  11. I found that it's best to use only straight marks in my TM databases. This macro converts single and double quotation marks to straight characters (' ") and not “smart” quotes (‘ ’ “ ”), and changes the Word settings that control them. After translation, the macro can change them back to “smart” quotes. (On my system, the macro is run using Left-Alt-' character.)
  12. Abbyy FineReader has the habit of randomly applying Bold and sometimes Italic formatting to periods (full stops) and spaces.
  13. A PDF document may have a header or footer on each page. By default, FineReader and most similar programs will simply add these bits of text into the main body of the document. Instead, fix the Word document by creating a proper header or footer. Include the codes for page numbers and total pages.

After importing into Déjà Vu, if there's any other pattern of codes that becomes apparent, I frequently fix the source document and reimport. 


Updated: May 23, 2010
Back to Steven Marzuola's Tips
Home: www.techlanguage.com