GTxcel will extract article text from the pdfs that you supply to create an HTML version of your articles. The purpose of this process is to provide a responsive and easy to read version of your articles for your users, specifically users using mobile devices.
*Note: This process is largely automated. How the pdf is created, including layering and embedded fonts, will affect the outcome of the extraction. Also, the design needs tend to be different from a highly stylized print layout and a pure article text layout. For these reasons, it will be necessary for you to review the articles and correct any abnormalities or design elements that the automated extraction process did not create to your particular needs.
Article Extractions include:
- All images in an article, including their caption text.
- Tables, charts, graphs, and figures.
- The embedded text of the pdf you provide will be extracted so all text will appear exactly as it was created in the pdf.
- (Optional) Extraction of Full page and Fractional ads.
Here are several items to keep in mind when creating your pdfs, as how the pdf is created will directly affect the quality of the article extraction output.
- Embedded Text and Fonts – All text and fonts should be embedded text. Text that is an image such as a vector graphic or bitmap graphic, rather than embedded text, will not be extracted as text.
- Line breaks – Line breaks should be used to create the end of a paragraph or force the end of a line at a desired point. If line breaks are not used, text may string together that was originally intended to display separately. Popular examples of when you want to ensure line breaks are used include lists and poems.
- Layers – If layers are used, the text layer should be the top layer.