Skip to content

coldcandor/ebook_formatter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eBookFormatter Version 2.0.0
----------------------------
Copyright 2005 - 2012
By Eric Shields

WARNING!
--------

First and foremost... In today's world this is a very necessary statement:  THE AUTHOR OF THIS CODE IS NOT RESPONSIBLE FOR THE MATERIAL UPON WHICH IT IS USED!

This is a free-time project I am pursuing in addition to full time work.  As such I do not have time to preform exahustive testing of each release.  Do not expect perfect preformance!  Some rare file setups may result in a worse formatting than the origional.  BE SURE TO BACKUP YOUR FILE(S) BEFORE USING THIS SOFTWARE, AS IT IS IMPOSSIBLE TO UNDO ANY CHANGES!

That said, if used properly, the program should not overwrite the original file, but I have to protect myself from stupid people now don't I?  Right, on to the real information!

What is eBookFormatter?
-----------------------

eBookFormatter is a program dedicated to the task of fixing those pesky formatting issues present in a great deal of online text documents, eBooks in text or RTF format, files converted from PDFs, and OCR (Optical Character Recognition, a popular method of retrieving text from PDF and image files) created documents, as well as anything similar in form.

eBookFormatter will allow you to convert RTF to plain text for editing (if you don't have an RTF editor, or just like plain text better), and provide many features related to the resulting text.  Perhaps you used an OCR on a PDF document, and the result was a text file with a line terminator (a newline, carriage return, or both) at the end of every line, preventing you from using formats like justification.  eBookFormatter will strip them away, leaving the text only terminated at the end of each paragraph.

eBookFormatter also includes a sophisticated method of detecting paragraphs, which is approximately 95% accurate in typical story documents, the ability to correct common errors produced by OCR, and much more.

Features
--------

* Sophisticated paragraph detection algorithm
* Correction of simple OCR related mistakes
	- erratic ellipses (...)
	- Sentence spacing issues
* Seperation of dialog segments based on quote locations
* Removal of extraneous blank lines
* Detection of numerated lists
* Removal of inter-paragraph newlines to allow for easy formatting (like justification)
* Removal of RTF formatting, resulting in plain text version
* Switch-based command line interface for easy option choosing
* Modulaized features, allowing you complete control over what formatting is preformed
* Built-in testSuite, so you can see for yourself that it's working correctly
* Many more to come!!

Installing eBookFormatter
-------------------------

eBookFormatter is written in PHP 5.  As such, it can be run on any platform (operating system and hardware independant), but it requires that you have PHP 5 or higher installed on your PC.  The installation is simple:  Visit www.php.net/downloads.php, select the appropriate file, and follow the installation instructions.

The windows zip package need only be unzipped and the directory containing php.exe added to the PATH environment variable.  Please visit your local help section for how to update this variable.  This is not a requirement, however.  You can always explicitly define the full path to either the text file, the php executable, or both.

Using eBookFormatter
--------------------

eBookFormatter is a command-line based program.

--- Usage ---

php eBookFormatter.php -i <inputFile> [-o <outputFile>] [-f <firstFunction>[,<secondFunction>[,<thirdFunction>[,<...>]]]] [-v] [-h]

--- Command line switch descriptions ---

-i <inputFile>                                               // Designate the input file or path
[-o <outputFile>]                                            // Designate the output file or path
[-f <firstFunction[,secondFunction[,thirdFunction[,...]]]>]  // Function(s) to
         preform.  PREFORMED IN THE ORDER LISTED!
[-v]                                                         // Do not append formatter ID at top of file
[-h]                                                         // If 100 is specified as one of the
                                                                functions, this tells the program
                                                                to output the results in HTML format.  If
                                                                function 100 is not present, this flag is
                                                                ignored.

--- NOTES ---

If there are any spaces in your directory path, MAKE SURE TO SURROUND IT WITH QUOTES!  Otherwise it will be interpreted as seperate arguments and the results will be completely unpredictable (it depends on what the arguements end up being).

If the output switch is not present, the result will be printed to STDOUT.

Any arguements that do not have an associated switch will be ignored.

All functions (except 'rtf') assume that the file is in Plain Text format.

--- Function Descriptions ---

Function 'paragraphs': Seperates the file into paragraphs, based on proper writing techniques.  This removes all origional line terminators.

Function 'rtf': Attempts to remove all RTF formatting in a file.  Still in it's alpha stages, this function may result in some loss of data due to the methods certain programs use to add the formatting.  The loss of data should be minimal (part of a line in uncommon conditions).

Function 'ellipses': Corrects many issues with ellipses, such as extra spaces or extra periods, converting them all into ...

Function 'spacing': Converts all instances of more than 2 spaces to 1 space and searches for places where it should be returned to 2 spaces.  Checks for extra spaces between punctuation marks and quotes. (For example . " would be converted to .")

Function 'split-conversations': Splits conversations that occur in one paragraph into seperate paragraphs in cases where an ending quote is followed immediately by an opening quote.  (For example, ..."  "...)

Function 'fix-quotes': Attempts to locate all instances of a single quote (') that are not part of a contraction or colloquial speach, and replace them with double quotes (").

Function 'testsuite': Run the testSuite.  This will preform a series of tests, compare the results to expected results, and output a summary in either text or html format, based on whether the -h switch is active.

Known Issues
------------

There are a few things that the program is known to miss or alter incorrectly:
	- Lines with no punctuation that represent a standalone paragraph, such as titles or specially formatted text like a blockquote of a letter a person in the story is seeing
	- lines consisting of only 1 character, usually used to denote time lapse or perspective changes
	- If by bad chance the line in the unmodified text ends in a punctuation character or dash that is not meant to be a new paragraph, one will be creaed anyway.  This has little real impact unless it is a dash being utilized as a word-concatenator, but is still incorrect.
	- Occasionally a larger than normal paragraph will be created because no tags were found to mark its proper end.

Support
-------

This is freeware.  As such, I am not in any way required to support the software.  However, bug reports, suggestions, and comments are welcome.  If I am able to do so, I will make additions, improvements, and fixes as time goes on.  Please make these reports using the appropriate interface (github or sourceforge trackers).  If this is not possible, then please email me.  See the end of this file for contact information.

Uninstalling eBookFormatter
---------------------------

As no installation is required, no uninstallation is required.  Simply delete the folder containing the eBookFormatter files. (Or just delete the files related to eBookFormatter).

Lisence Agreement
-----------------

This software is released under the GNU GPL (General Public Lisence) and is free to use, modify, and distribute so long as a complete copy of the lisence agreement is provided with the software.  See COPYING for the full GPL.

Donations
---------

To donate to this or any other of my projects, send a paypal payment to shiermail-ebay@yahoo.com.  Please include what project you are supporting in the notes and if you have any suggestions for what to add next.

All donations will be put towards removing obticles preventing me from furthering my projects, i.e. having to work for food :P

Contact Information
-------------------

shiermail-website@yahoo.com
http://www.coldcandor.com

About

eBookFormatter project from 2005, finally versioned.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages