Hand-Writing EPUBs
Page last updated , CESIntro
Right now, I just finished converting Activist Study (ARAK) from PDF to EPUB. It took me 3 days of constant work, copying 160 pages paragraph by paragraph and formatting them carefully to match the original.
Calibre can rip the text out of a PDF in seconds, but it doesn't do it very nicely. In the best condition, you will have to remove double-spaces, tabs at the start of paragraphs, some line-break hyphens in the middle of words, etc. but you can do this quite quickly with the find-and-replace tools in LibreOffice Writer and the Advanced Find and Replace plugin.
The EPUB generated by Calibre from the ARAK PDF was a garbage fire however. Absolutely unreadable; paragraphs spliced in between each other, page headers, footers, and page numbers sprinkled into the body text, the second half of the entire book was bolded for some reason. So using Sigil and VSCode, I had to sort through the text in the garbled EPUB and copy it into a nice, clean, new EPUB, so writing it from scratch.
This, by far, is the worst way to spend your time. It exhausted me. But I will teach you what I've learned, because it's nice to have EPUBs.
Also have a look at:
- How to Create EPUBs a very useful (and pretty comprehensive) guide to using Calibre and LibreOffice to generate EPUBs. This is the best way to make EPUBs from PDFs, the way in the rest of my document is about what you need to know when that doesn't work. Arguably, read How to Create EPUBs first.
- Installing the tools for making EPUBs
- R. Frobnitz, Converting PDF Books to EPUB with Calibre and LibreOffice
EPUB File Format
EPUBs are a bundle of HTML and CSS files, media assets (images, or even audio and video files), and metadata compressed into a .zip.
You can see inside an .epub file by renaming the file extension from .epub to .zip, and then unpacking that file. Make sure you have file extensions visible, otherwise you'll have to change the file extension by right-clicking on the file, going to Properties, and renaming it from there.
Inside are the HTML files, CSS files, and folders for assets and some of the metadata.
There is also the content.opf file, which is a manifest of all the files in the EPUB as well as containing the order in which the files should be displayed, and a toc.ncx which is the contents page.
HTML, CSS
EPUBs are kind of like simple little websites. HTML (hypertext markup language) and CSS (cascading style sheets) are the files that make up websites. HTML contains all text and instructions for the arrangement of things on the page, and CSS is used to style them all.
Don't be put off by HTML and CSS. They're both messy looking, but very simple to use and understand once you get the hang of it.
I can't (read: won't) go into detail about them here. Read the W3Schools guides on both HTML and CSS if you'd like to learn how to use them. W3Schools has the best guides available for learning HTML and CSS, and also a variety of tutorials for programming languages.
All you really need to know is that when an e-Reader device opens an EPUB, it will load all the HTML files contained in the .epub file in the order specified in the <spine> part of the content.opf file. More on that later.
Opening and Editing EPUBs
2 programs; Calibre and Sigil; though you can open HTML files in almost any text editor.
Calibre is an e-book library and reader that can generate EPUBs from PDF files.
Sigil is a proper EPUB editor. You can open EPUBs and see and edit the HTML and CSS, re-order, add and remove pages, edit metadata, and auto-generate the table of contents, indexes, and nearly all the stuff you might ever want in an EPUB. When you're editing the HTML or CSS, you can also see a preview of how the EPUB will look.
e-Reader
The HTML files are usually split into the chapters or sections of the "book". One HTML file, in the <body> section, will contain all the text in the chapter organised by paragraph, its headings and subheadings, all neatly in order and easily readable by both you and machines.
When the e-Reader opens the chapter, it will display as much text as it can fit on the screen, and then breaks the page (ends it). The user turns the page and the text continues. When writing HTML documents for an EPUB, you don't specify where the page will break, because you don't know where it will break. The user can set the text size and font, line width, etc and they will all have different sized screens. It's the e-Reader that decides where the page breaks.
You decide where the chapter ends though, through making a new HTML file. When the chapter ends, the e-Reader will display blank space below the last line, just like a real book does, so that the heading for the next chapter can begin at the top of the next page.
e-Readers specify a lot of their own CSS (styling, for if you didn't learn about CSS on W3Schools, that means margin sizes, font type, colour, and size, line width, centering and justification of text, and so on) when an EPUB is read, so to avoid things breaking or looking messy, it's best to use as little CSS as possible.
Structure
In order to learn how the files in an EPUB were structured, I had opened the Calibre manual in Sigil and copied how it was laid out, cross-referenced with the EPUB that Calibre had generated.
In short, the HTML files were very barebones, only the <title> and a link to the stylesheet (which I actually never really needed) in the <head> section and the <body> section containing only headings, paragraphs, some spans for when I didn't want margins above and below the text as a paragraph would have, and text formatting elements.
I used XHTML files, which were what Calibre generated, so I just copied the !DOCTYPE and the lines around about there into new XHTML files.
XHTML is "HTML which conforms to XML standards" which, I'm not sure exactly what that means, but apparently it's better machine readability, which I guess is what you want for an e-Reader. Structurally it meant nothing except for some extra lines at the very beginning of the file.
Workflow
When I was writing out ARAK, I had the garbage fire EPUB that Calibre had generated open in Sigil, and copied the text paragraph by paragraph, checking it against the PDF open in Firefox, into nice, clean, new XHTML files in Visual Studio Code.
Visual Studio Code is free and probably the easiest thing to write HTML in. I'd recommend it over Sigil for the bulk of HTML editing, because VSCode has some features like autocompletion of annoying HTML syntax and easier tabbing than Sigil.
When I finished a chapter I imported the XHTML file made by VSCode into Sigil by right-clicking on the document tree on the left hand side and hitting "add external page". Any changes made in Sigil to that file don't get updated to the original.
I could then view the EPUB every so often by saving it (Sigil saves ready-compiled EPUBs) and opening the file in Calibre. When I made changes to it in Sigil again, I had to delete the EPUB in Calibre and import the new one, because Calibre keeps all its EPUBs in its own directory, that you chose when you were installing it.
Accessibility
I don't know as much about accessibility as I really should, especially about HTML, so you may have to search for most of this information yourself.
In short, HTML encourages you to declare exactly what an element on a page is using its syntax, called tags, which you can find a complete reference to at W3Schools. The tags allow both screen readers and other machines to correctly interpret the page and its content, and making sure you format the HTML properly is important for readability and compatibility with e-Readers.
For example, there are lots of different tags that cause text to be displayed in italics, but they're all for different purposes. for example, <cite> is for the name of a work or another text, <em> is emphasis and will cause screen readers to speak with added stress, and <i> is for anything that doesn't fit into any of the other italics tags.
On the one hand, it is a trial and error process. Some e-Readers will display an EPUB exactly how you wanted, and some will garble it. Testing is the most important process for accessibility. You should make sure that the EPUB is both readable on whatever e-Readers you have to hand, and also that it works fine with screen readers.
Creative License
An EPUB, or eBooks generally, are not just digitised books. That's what PDFs are. EPUBs fundamentally work differently, because they are written in HTML.
For example, you can't put footnotes at the bottom of the e-Reader screen on the page that they reference, like would be in a paper book or PDF. I put them instead at the end of the chapter, below a horizontal rule <hr/>.
Generally, you cannot have footers, and you cannot leave whitespace in the middle of a page without specifically declaring the size of the space in px or em/rem. EPUBs are extremely basic, like an ancient website.
In EPUBs you are able to hyperlink between different parts of the book. The table of contents usually always has hyperlinks by default, but you can add them anywhere with <a>, for example, to link between footnotes and their citation marks. You would give the footnote an ID, like id="footnote1", and then have the citation mark wrapped in a link, like;
<a href="#footnote1"><sup>1</sup></a>
You can also use a lot of the features of HTML meant primarily for websites, like <div>, but they'll get broken on some e-Readers. For example, you can use <div> and CSS float to split a page into two columns, and this works on the Calibre e-Reader, but not on the Android e-Readers I tried.
EPUB Metadata
The easy part is stuff like author, ISBN, etc. Follow best practices and put in as much info as you can, Sigil makes this easy.
The more difficult part is the table of contents and the manifest. Sigil can generate these for you, and it does it very well, but the table of contents especially can look very messy, as it defaults to adding every subheading into it.
In toc.ncx> (or XHTML) the table of contents is listed in the <nav> section. It will either accept just links <a> to pages one after another; or it will accept one (and only one) ordered list or unordered list <ol> or <ul> with any other lists tucked inside of the primary one as a list item <li>.
An example contents page is something like this, not including the surrounding HTML that Sigil will generate for you:
<nav>
 <h1>Table of Contents</h1>
 <ol>
  <li><a href="chap1.html">Chapter 1</a></li>
  <li><a href="chap2.html">Chapter 2</a></li>
 </ol>
</nav>
- Where <h1>is the heading of the contents page;
- <ol>is an ordered list;
- and "chap1.html"> is the name of the HTML file (which can be called anything, it's good to give it a descriptive name) that you've put into Sigil.
It's useful to seek out live examples, like the Calibre manual. That's where I got the gist of how to lay out the table of contents.
Further Reading:
- How to Create EPUBs a very useful (and pretty comprehensive) guide to using Calibre and LibreOffice to generate EPUBs. This is the best way to make EPUBs from PDFs, the way in the rest of my document is about what you need to know when that doesn't work. Arguably, read How to Create EPUBs first.
- Installing the tools for making EPUBs
- R. Frobnitz, Converting PDF Books to EPUB with Calibre and LibreOffice
