Why I Still Sometimes Print Papers: And How I Do It Better?

There comes a time when you must read a long paper, book chapter, or text. This task feels too important to do on a PC or phone screen because you do not want to be distracted. This is the time I want to read long papers without distraction, but printing wastes paper, and e-readers are tricky.

tldr

If you’re dealing with Image-Based PDF: use Scantailor. Prepare images:

and then

If you’re dealing with True-Digital PDF:

and then

Why would I print it, if it will take so much paper?

Back to the problem: you want to read a paper/book without wasting too much paper. Let’s analyze it: 500 pieces of A4 paper cost around 13 PLN. The printer toner, which was not produced by the producer, might cost around 0.0153 PLN per page. From this, we see the $0.0153 + 0.013 \approx 0.03$ PLN per page (not sheet).

Printing the 300 A4 page book will cost around $8$ PLN, which might be a little high for a home environment and needs. Here comes the idea : you print many PDF pages on one A4 page. The standard solution is to print two pages side-by-side on a horizontally oriented paper page. This would cut the cost to 4 PLN.

I believe you can print 4 pages of a typical paper article (2x2 portrait) and keep it readable.

The reason why you should do it is not limited to money issues. When you can see four pages simultaneously, alongside your pen notes and highlights, it might be easier to read, think about, and discuss them with people.

How to print it 2x2?

Assuming you

you can

pdfCropMargins works great on true digital PDFs created using eg. MS Word or Latex. I like it more than pdfcrop - a tool shipped in texlive-extra-utils apt package. You might want to make some wrapper like

cat > ~/script/pdfcropmargins.sh <<'SH'
your/python/venv/path/python -m pdfCropMargins "$@"
SH

and invoke it like ~/script/pdfcropmargins.sh.

Values --trim '-9.5mm -4mm -9.5mm -4mm' relate to the fact that cropping tools often leave some space, which we need very much, cause we want to shrink a big page into a small page. --offset '5mm 0mm' set the vertical space between sub-pages to 0 and horizontal to 5mm. See the cropping problem below for further details.

The result often varies depending on the input file. For example, it might work great for the typical A5 paper size often used for novels:

piasecki przed piasecki po

The 2x2 page I pasted above looks fantastic printed! Before I switched to E-Ink-based E-book readers, I printed and read many books like that.

Unfortunately, we might not produce the best results when dealing with certain page sizes, like here:

wolnelektury_first_page_in.png wolnelektury_first_page_out2x2.png

You might have to fold this for reading, but it will be ok.

I would also do this for Image-Based PDFs if I wanted to do things fast and move along. Below, I want to present the approach that works better for Image-Based PDFs which uses Scantailor.

The cropping problem

The cropping is the most important part of preparing papers for printing: when you have a well-defined region you want to print, and every book page is a separate PDF page (see below to get what I mean), the rest is trivial.

pdfcrop works on digital PDFs where things are trivial - you estimate the interest region by looking at the PDF’s TextBox. On the other hand, I encourage you to look at those PDFs “artifacts” in the best free PDF-edition software on the market - LibreOffice Draw.

Image-based PDFs are harder in that aspect. We have a solution for that problem, which I believe (not sure) uses some computer-vision techniques like threshold-out black-points around the region of interest ROI (obtained using some biggest cluster algorithm or mentioned OCR layer?).

What is the region of interest? Of course, it is the text you want to read. But does this include footnotes, a header (with information about the chapter and such), or a page number, which can be far down the text? The bigger the region of interest we have, the less readable the printed page will look (because it will be smaller). There were times when I decided to cut page numbers to make text bigger, and then I would write page numbers with a pen on printed pages.

Scanned page case

We often have to deal with scans like this

raw_scanned_page.jpeg

For this, it is good to use software with a GUI to select ROI. I chose Scantailor Advanced for this task. Scantailor is no longer in active development, but it works for my purposes. I encourage you to try some forks of the original repo. I got Scantailor from flatpak.

Work with Scantailor consists of 6 stages: Fix orientation, Split Pages, Deskew, Select Content, Margins, and Output. You can read about those, but I will focus more on parameters/options crucial to our problem, ie, preparing a PDF for printing.

Page division works as expected:

nice_page_division.png

Selecting contents works well. If something were to happen, you can adjust the boxes manually for each page you want.

It is usually a good idea to turn Margins->Alignment->Match size with other pages option off.

Also, Scantailor provides an elegant way to control threshold parameters:

After you are done, you join images into a PDF and OCR it. You can save the Scantailor project if you want - I sometimes have to work twice on the same book, for instance, when you see that the document does not meet your needs after printing, and I have already closed my PC.

Sometimes you need to print the page to see if it is readable for you.

Kim_out.png

You can remove temporary files if you are satisfied with the result out2x2.pdf.

rm -r images* doc*.pdf out_all.pdf out.pdf out_c.pdf dst.tif >/dev/null

ocrmypdf is a cool wrapper for Google tesseract. -l pol lets you select the language for tesseract OCR.