Note: The text extraction process will give publishers a start on creating the article text view. However, this process is not precise. How the PDF is created, including layering and embedded fonts, will affect the outcome of the article extraction. Also, the design needs tend to be different from a highly stylized print layout and a pure article text layout. For these reasons, it will be necessary for the publisher to review the articles and correct any abnormalities or design elements that the automated extraction process did not create to your particular needs.
Article Extractions include:
- All images in an article, including their caption text
- Tables, charts, graphs, and figures
- The embedded text of the PDF you provide will be extracted so all text will appear exactly as it was created in the PDF.
Here are several items to keep in mind when creating your PDFs. How the PDF is created will directly affect the quality of the article extraction output.
- Embedded Text and Fonts – All text and fonts should be embedded text. Text that is an image such as a vector graphic or bitmap graphic, rather than embedded text, will not be extracted as text.
- Line breaks – Line breaks should be used to create the end of a paragraph or force the end of a line at a desired point. If line breaks are not used, text may string together that was originally intended to display separately. Popular examples of when you want to ensure line breaks are used include lists and poems.
- Layers – If layers are used, the text layer should be the top layer.