You’re the lucky developer who got assigned the PDF report project? You probably quickly discovered it's more complicated than building normal web pages. So which HTML to PDF converter should you use? How do you compare them?
Some elements of creating PDFs are simple, while others elements are more complex, tricky, slow, and even impossible in some HTML to PDF conversion libraries. But no worries, this article will show which document elements you should test first.
You have many options, from open-source HTML to PDF libraries to online HTML to PDF APIs to commercial libraries overseen by the inventor of CSS. But they all vary in their ability to create advanced PDFs and in their conversion speed.
This article will guide your HTML-to-PDF library comparison process, saving you days of trial and error. Changing converters later can be painful, as libraries often use different CSS and HTML to accomplish the same result. It’s better to make the right choice the first time!
The core problem is that the original HTML and CSS standards and web browsers focus on building a long, continuously scrolling web page—not making documents or books with many smaller pages. You’ll see this core difference between browser engines and PDF requirements pop up again and again throughout this article.
There are now some excellent proposed and draft standards for creating paged media with CSS (such as PDFs or presentations). Still, there’s not a lot of momentum within the browser development teams for implementing these standards.
Additionally, many PDF creation libraries are focused on creating printed materials and books where conversion speed is less important than web-based reports.
There are elements of PDF generation that are simple and hence pose no performance issues, but others are complex and problematic from a performance standpoint. Let's take a closer look at these tricky PDF generation elements. This way, you can be better prepared to address possible performance problems.
Most libraries allow for CSS-defined page breaks, but their ability to define page breaks automatically differs. Document elements such as tables, images, floating elements, and footnotes complicate where the page should break and how much whitespace should be left on the page. Some libraries will struggle with documents containing multiple pages and anything other than plain text.
You’re probably familiar with books and reports with title page styling that’s different from the content page; or with left pages that have different margins than right pages. Some PDFs use landscape layouts for tables and portrait layouts for text.
The CSS Paged Media specifications let you define different styles, layouts, and sizes through named pages. Unfortunately, browser engines have neglected to implement these specifications, so you cannot use them in many open-source HTML-to-PDF converters.
Accessible PDFs, or "tagged" PDFs, contain hidden tags that describe the document to screen readers and other assistive devices. They're similar to HTML's
alt attributes. Only some PDF generators support tagging your content, and they all offer differing levels of support for automatic tagging.
A small, simple table is supported by almost every converter. Problems can occur when your table spans multiple pages or columns, has captions, is floated, contains images or charts, uses row or column spans, or is incredibly long. Many engines—particularly those that are open-source and non-browser-based—are unable to handle these complex calculations or consume too much memory or time.
DocRaptor has advanced table support, including defining rows and column spans through CSS and selecting when and where table captions appear.
Though well-supported by browsers for years now, some libraries will struggle with CSS columns. They can be especially problematic when the columns contain figures and tables or cross multiple pages.
Floating content left or right is commonplace in web development, but some libraries struggle with correctly selecting the page break location when a float is involved.
Often, you may want to float an object to the top or bottom of a page or even to the inside or outside of a page spread. This is only possible in the most advanced rendering engines such as DocRaptor, Prince, or PDFreactor.
Almost all libraries offer basic support for raster images and most support vector (SVG) graphics. Some libraries will convert vector images to raster images, which destroys their ability to smoothly scale up or down. Other libraries may offer more advanced image quality options, letting you fine-tune the output size of your document.
Despite being exceedingly common in publishing, footnotes are one of the largest areas of differentiation between HTML to PDF libraries. Most conversion engines struggle to support even simple page-level footnotes, as this concept is completely unfamiliar to web browsers and standard CSS.
CSS-based footnote support was defined in the CSS Generated Content for Paged Media Module specification draft, but support is limited to more advanced HTML-to-PDF converters. These specifications allow you to use CSS to define how and where footnotes should be displayed.
Most libraries support basic page and chapter number counters. Some offer limited styling and placement options, while others allow you to insert the page counters wherever you want through the CSS
Some libraries also support more advanced versions of generated content, such as cross-references. These lets you create links such as "See Page 38 for More Details", where the "38" is automatically generated.
When your document is intended to be printed, it's critical to match the colors to your specifications in your printer's desired format. Some libraries support only RGB documents, but most offer at least CMYK support. More advanced generators allow you to define a specific ICC profile for your document.
Some advanced libraries also allow you to define printer's marks through CSS, which is helpful for documents intended to be trimmed by the printer.
Many watermarks are intended to be on every page, and this can be difficult to accomplish with browser-based engines. A workaround can normally be found, but watermarks will be easier in any commercial HTML-to-PDF generator or most non-browser-based open-source libraries.
We hope this guide has illuminated the areas of HTML to PDF conversion complexity. If none of these elements apply to your document, you can easily use any available library or API. But if your document contains these elements, we recommend that you be more selective and focus your initial implementation testing on these elements.
You may find that you need a more powerful HTML-to-PDF generator. That's the situation we found ourselves in over a decade ago and why we built the DocRaptor HTML-to-PDF API. It's helped thousands of developers and could help you too!