The Joys of Bookmaking in a Digital World
No, I don’t mean the kind of bookmaking that involves gambling! This post discusses techniques you can use to make or assemble your own ebooks that can be read on your Sony Reader, Amazon Kindle or other eBook reader you might own. You might be wondering why someone would want to do this. After all, there are many ebooks available for sale or downloadable from libraries (provided you have a Sony Reader) and many web sites that provide free ebooks for download or reading online.
But there are a number of reasons to do this. One is the cost factor. Recently I assembled a programming text for a somewhat obscure computer programming language. This book is a late model text that retails for about US$65. How ever, the author maintains a website for the book where he allows readers to read or download the chapters. His purpose in doing so, he states, is that he hopes the book will serve as a useful introduction to those who are curious, but who are not curious enough to shell out big bucks for a dead-tree book. Hey! It works for me. I’m not monetarily profiting from reading it, and I consider it a very useful introduction. He does go on to remind us that the book is available for purchase. Anyway, this book is over 500 pages long, and I am well into it reading it using my Sony eBook Reader. I won’t disclose the book for legal reasons. But so far as I know, Amazon nor any place else sells a digital version of this book. That’s a good reason in itself.
And another reason is that some of these free ebooks available from places like Project Gutenberg may be freely available, but not in a format that your reader understands. While another reason is that there are many, many books available online as web pages. For example, http://www.freetechbooks.com/, http://2020ok.com/index.html, http://www.baen.com/library/ or a whole list of them found here: http://www.freeonlinereading.com/links.htm plus many more. These online books might already be in the correct format or they might be hypertexts, and as such, some digital readers won’t understand them in that format. Many places do provide formats for most every kind of reader. So it just depends on where you go. There are probably a lot more reasons too. But I’ll keep this short.
So, to start I’ll list the tools that I use to assemble these ebooks:
- Internet Explorer (recent versions preferred)
- Microsoft Word (however, a compatible alternative might work)
- A PDF virtual printer (Word 2007 and Word 2010 have these as part of their feature set. If you use an alternative for Word then this is probably required.)
- PDF editing tools – optional (The one I use only edits meta data properties such as the title and author and is freeware.)
And now to describe the process. This process calls for Internet Explorer because it has a “Save As…” format that most (any?) other browsers do not. But if there are some that do, then those may work as well. The “Web Archive, single file (*.mht)” format defaults when you perform a “Save As…” in Internet Explorer versions 7 and 8. In version 6, as I recall, you must change from the default format of as a web page, complete to this *.mht format. It may be possible to install a plug in for other browsers also to add this functionality or to download a stand alone program. I’ll leave it to you to explore those possibilities.
So, what’s the big deal with the *.mht format anyway? Firstly, it’s a archive format. It creates a single file, but the contents can be many files. Think of it as a Zip file, but without the compression, and you’ll get the idea. This is important with web pages because web pages can contain other parts, such as images, that are normally stored separately from the file or directory on the web server that contains the text and formatting of the document. With this file format, those images are all in the same file as the text, formatting and layout, and this makes the job of constructing the ebook a lot simpler.
This file format has a rich history, by the way, and I have extensively used it in the past at my places of employment for other reasons such a software technique called round tripping. This versatile file format is a pure text file, but can encode binary data, such as pictures, and as such, it is easy to edit and alter files with this format with tools that have no awareness of or have limited utility toward binary formatted files. This is the case, for example, with many Unix utilities or programming languages that don’t manipulate binary data very well. This format is well documented. I have read several of the RFC’s on it, but you don’t need to understand anything more about it other than it works. However, the Microsoft Office binary file formats are not very well understood by most and typically require a great deal of effort to modify without specific applications. Internet Explorer does not save as any of those binary formats.
Incidentally, many of us use this file format on a daily basis. This format is the MIME encoded format that email uses. Have you ever dragged an email out of Windows Live Mail (the desktop version – not the web based one) onto your desktop or into a folder? It creates a file named with the subject of the email and a file extension of *.eml. (Note: You have to have file extensions set to show in order to see them!) This format *.eml is the same MIME format as the *.mht files. If you do this, and then open the file with a text editor, (using “Open With” or “Send to” for example) you can see the internal structure, the guts – so to speak, of an email. There is no binary data in there. You can see the text of the email, but you’ll likely see a bunch of other stuff that looks interesting, but you may not understand it. This stuff textually encodes the binary data and the structure of the document – in this case as an email. And it’s all text. This is key. Because of this, it satisfies a “lowest common denominator” scenario allowing emails to be sent over multiple diverse systems and remain intact. But it also allows emails to have attachments. And those are encoded within the single file.
What else is important about this format? The next big reason is that the Microsoft Office suite of applications understands and can edit this *.mht file format. And because it does, we can easily get multiple web pages saved in this format into a single Word document – complete with all the pictures, diagrams and images. And it retains the original formatting and layout. If you try this with another application instead of Word, you may or may not be able to do this. Let us know your results.
So far, we have explained why we’re using Internet Explorer and Word. Now we’ll cover the PDF part next. PDF files are a binary format that is, for the most part, a universally understood document encoding. Most people have Adobe Reader installed on their computer and can open these files and read them. Also, most eBook readers can load them and display them.
My personal eBook reader is a Sony variety. And while all eBook readers have their pros and cons, I chose this one for several reasons. One reason is because I can load documents – ebooks – directly from my PC to the reader. Many ebook readers can do this, but as was recently discovered, Amazon can remotely remove items from your Kindle device without your consent as was done, ironically, in a “big brother” fashion with the “Nineteen Eighty-Four”ebook by George Orwell. This doesn’t happen on my device because it does not have the wireless WhisperNet connection. One final reason I chose Sony is that Sony is the only reader on the market that can display a digitally encrypted PDF documents. And my online library, OverDrive, lends just such ebooks. This allows a digital return date on borrows which makes the ebook unavailable after the loan period expires. This satisfies publishers and authors who would otherwise not allow libraries to lend digital ebooks by only allow a certain number to be checked out at a given time. The Kindle and all other brands currently are not able to achieve this. Oh, then there was the cost.
So, I chose the Sony eBook Reader, and will describe the process specifically for it. But, the process should be generic enough for other reader devices that have similar features. OK – enough with the disclaimers…
The process is rather simple. I would advise you to install the PDF virtual printer software regardless of what application you use to edit Word documents. You’d be surprised how much you’ll use it for other unanticipated things such as keeping a digital copy of a receipt for an online purchase. When I print them out – I lose them promptly. I PDF them, and I also lose them, but Windows has a wonderful search engine that finds them again fast for me. Anyway, I use the freeware Bullzip virtual printer, but have used others. There are many free choices out there.
This approach provides additional advantages beyond what Word provides as it works with almost any application. For example, if you were to download a purely text document from Project Gutenberg you might be able to load it directly to your device if your device understands text files. I imagine most do. But if not, you could just print it from Notepad, for example, directly to a PDF file on your computer. This new “printout” will be a PDF file. It works with any application that can (or has permission to) print – with the exception of applications capable of opening encrypted PDF’s. So you cannot open an encrypted or protected borrowed library ebook then print it out as a PDF from Adobe Reader or Adobe Digital Editions. That’s bad and illegal. And it won’t work. And if it isn’t encrypted, well, it’s already a PDF! So, you can save any single web page as a PDF just by printing it. But many eBooks come in parts. And combining those parts becomes the issue.
And that is where Word comes in handy. For example, the eBook I described at the beginning of this post about the Lisp programming language actually is thirty-three different web pages. Each one is a different chapter. I suspect that the author intentionally did this to prevent people from just downloading a single file and never buying the book. And that’s why I left out the title of the text. But since both Word and Internet Explorer understand the *.mht file format in common we can use Word to open the web page files we save from Internet Explorer.
We should note at this point that there are many ways to accomplish these various tasks. For example, you could cut and paste the contents of individual web pages into Word. Now, I’ve not had good luck with the formatting doing that, but you might have better luck or wish to manually control all the formatting regardless.
So here’s the process I use:
- For each web page of the online text – open it in Internet Explorer – this may entail clicking a “Next Page” link or something like that.
- Save the web page by choosing Tools>”Save As…” and give is a usable name if one does not already appear by default and use the single page web archive format *.mht that has been discussed and save each of them – all in the same directory.
- When you are done with that, then make a backup copy of that directory as later parts of the process can alter the original files.
You now have the data that your ebook will be built from. One other thing to note about this technique is that if the ebook has a “one web page per page of book” layout, then this process can become tedious. In those cases, if the book is several hundred pages long, another technique can be used to “mirror” the book on your local computer using other software and then stitch it all back into a single document. I plan to discuss that in a later post provided it isn’t a rocket science endeavor.
The next part of the process combines all those *.mht files you’ve already created. And again, be sure you backed up that directory. If you have to start over for some reason you’ll avoid having to resave all the original pages again. This is very easy to do. In Windows just right click on your folder then select copy. Find where you want to save the backup and right click in a white area and select paste. It’ll backup all the parts of your ebook.
Next open Microsoft Office Word or a compatible alternate. What ever you use, it must be able to create master documents. Using the “Create a master document and subdocuments” process described on this Microsoft Office Online website located here:
we’ll create a document outline. This will be the master document. The process described by the link above is specifically for the 2003 version of Word, but the same instructions work for 2007. I named each top level of the outline the same as the chapter name. I then added each *.mht file as a subdocument to the outline. You’ll be prompted about editing the original files. If you made a back up, it won’t matter how you answer. I let it update the originals. Otherwise it will create it’s own copies. But this is up to you.
The process to “Create a master document and subdocuments” is straight forward so I won’t detail it. You can alter the original subdocuments in Word at this point. I usually add page numbers in the footer, but you can change the fonts making them bigger or a different typeface. Use your imagination.
Next you’ll probably want to set the document properties. These are things like the title and author. These are inherited into the PDF document you’ll create, and they are used by your eBook reader to organize your ebooks. In Word 2007 these are set using the Office Button> Prepare>Properties. They can be set in the 2003 version as well, but I can’t document that. I don’t have a Word 2003 handy at the moment.
If you forget to set these, then you can use a property editor to fix the resulting PDF file later. I have used the freeware BeCyPDFMetaEdit for this purpose. Most PDF editors that can actually create or edit a PDF will cost you. If you have on, such as Adobe Creator, then you can use it to do a lot of this work instead of using Word. But it is very expensive. BeCyPDFMetaEdit does the job, and it’s freeware, but it doesn’t do much to edit the ebook content. Do that in Word before you make the PDF.
Once you have created your ebook in Word all that is left is to make it into a PDF. Word 2007 originally required you to download an add-in for this. But if you have the most recent service pack, then it’ll already be there. Just click the Office button>Save As>PDF or XPS. Fill in the name you want to give it, make sure you select the PDF format, and click Save. You will now have a PDF file of all those web pages, but all in one PDF file that is easy to read on most any reader device. If you’re using an earlier version of Office, then install the freeware Bullzip PDF virtual printer. The process is the same except that you now “Print” the document to a file location.
You can now load this document on your reader using whatever means are required to accomplish that task. With my eBook Reader it is simply connecting the reader to the PC with standard USB cable and dragging the new file to the folder on the device where I want it. With a Kindle, they might actually look at what you are doing. I don’t know, but I would expect that they might figure out that the book I just created is one they also sell. In such case, they might prevent it from loading or delete it later. But I am speculating and don’t know that for a fact. There may actually be no such issues. But you need to decide these things before you purchase an ebook reader. None that I have seen are perfect, and many leave a lot to be desired. And I am not a big fan of Sony, but I made an exception in this case. Also, rumor has it that Apple will be releasing an iPad ebook reader sometime soon. We will have to wait and see.
In conclusion, a few things should be noted. Firstly, most, but not all web pages can be saved as a *.mht file. Ironically, the pages I have the most trouble with are those on Microsoft web properties. For example, the link for master/subdocuments documentation above will not save as a *.mht using this technique. This page uses scripts and has an ActiveX control which cause problems. Other times, a stray character out of the basic code set for the page will intentionally be used, and it’ll trip up the web archive encoding. These issues don’t stop me, but getting around that problem is another story for another post. But you can use another format should you run into this problem. A full “Web page, complete” format is also available, but it isn’t as friendly. Each page will become two parts. One is the basic page sans pictures, images, diagrams, etc., and the other is a folder containing the other parts. If the first method fails this one might work. In any case, I have never run into an online book that would not save as an *.mht format and required this other format. But saving in any format that both Internet Explorer and Word understands is key. It can even be plain text. Also, as I mentioned before, you can cut and paste the different parts into place from Internet Explorer into Word. This will likely require manual tweaking to look right in the end. Secondly, you may be lucky and find the entire book in a single file. If you have an application that can open that file, you can print it out directly to a PDF format file using a virtual PDF printer. Many ebooks are just single plain text files. In those cases, I would open them in Word or even WordPad and change the font to something more readable on your device, and then print out the PDF. Thirdly, there are a good number of ebooks that this process will be arduous to apply against. These eBooks tend to display only a single page of the ebook at a time. These ebooks can be converted using this process, but it is tedious to deal with hundreds or thousands of pages. I am working on a technique to automate this for this type of ebook because, as you probably surmised by now, there are some out there that I want myself. And finally, many ebooks are hypertexts. These ebooks will have embedded links to other pages. This tends to happen mostly with the ebooks in the last category, i.e., the ones that have one page at a time delivery. These links can be modified in Word to allow your reader to go to the correct place in the document, but it requires manual tweaking to do this – at least at this time. I am also working on automating these linkages. In the end, I am hoping that pages like a table of contents will have working links to the correct parts of the document. Currently, you have to set this up by hand, and it is beyond the scope of this post to describe that process.
And one final idea before I close: regardless of where you intend to read the eBook text, if you make it a PDF then it is a portable document. PDF stands for Portable Document Format. It can be read on a PC or a Mac or Linux or maybe your mobile phone or your eBook reader or… These can be emailed or stored in your online storage such as your SkyDrive. I use this technique to share articles from online publications that require you to log in to read them. This technique can also be used to make a document version of an entire (static) website. You are only limited by your imagination. So spread the joy!