Skip to content
LastZactionHero edited this page Nov 11, 2011 · 5 revisions

What is Linter?

Linter is an open source Java library for scraping preview meta data from websites. It is designed with a focus on speed, reliability, and customization, necessary for automatic scraping applications like crawlers. Linter is an open-source, Java alternative to expensive 3rd party services.

Linter provides basic preview data out of the box, but its plug-in architecture allows developers to quickly extend it's base functionality to capture any arbitrary data from all or specific websites.

Linter was designed for fast, reliable URL processing for Crowdspoke.com. See it in action: http://crowdspoke.com/topics/world-news

Features

  • Resolves shortened URLs to their destinations
  • Parses essential meta data out of the box, including page title, description, and preview image
  • Plug-in architecture allows users to quickly develop new parsers for very specific data or services

Building and Dependencies

Linter requires log4j (http://logging.apache.org/log4j/index.html) and the Jericho HTML parser (http://jericho.htmlparser.net/docs/index.html). Linter uses Apache Ivy and Ant for dependency management and building.

Sample Application

The Linter.java application provides a full demonstration of Linter. This application parses preview meta data from all URLs passed to the command line. Linter.jar is provided in the /samples directory.

java -jar Linter.jar <url1> <url2> <url3>

Example:

java -jar Linter.jar http://engt.co/qQFOC1

http://www.engadget.com/2011/10/18/android-4-0-ice-cream-sandwich-now-official/ {
  DEST URL: http://www.engadget.com/2011/10/18/android-4-0-ice-cream-sandwich-now-official/
  title: Android 4.0 Ice Cream Sandwich now official, includes revamped design, enhancements galore -Engadget
  meta_provider: linter
  description: Google has taken the stage in Hong Kong to make the next version of Android OS, nicknamed Ice Cream Sandwich, a thing of reality. Better
  fav_icon_url: http://www.blogsmithmedia.com/www.engadget.com/media/favicon.ico
  type: link
  provider_url: http://www.engadget.com
  preview_image_url: http://www.blogcdn.com/www.engadget.com/media/2011/10/ics.jpg
  provider_name: engadget.com
} in 2.0000 s

First Use

Linter is easy to include in new applications:

LintedPage lp = new LintedPage( "http://www.google.com" );
lp.process(); // Scrape preview meta data
System.out.println( lp.toDebugString() ); // Print out preview meta data
lp.getMetaData(); // Access preview meta data hash

Basic Meta Data Fields

Out of the box, Linter scrapes meta data for these fields:

  • Title (key: "title"): Page title
  • Description (key: "description"): Page description
  • Preview Image (key: "preview_image_url"): Best preview image on page, picked from meta data or best guess
  • Provider Name (key: "provider_name"): e.g. "Facebook"
  • Provider Url (key: "provider_url"): e.g. "http://www.facebook.com"
  • Favicon Url (key: "fav_icon_url")

Advanced Use

See the ServiceParsers Overview for detailed information on customization.

Long-Term Goals

  • Wide range of specific, up-to-date parsers
  • Support for output in oEmbed standard
  • Support for services that require log-ins (e.g. Facebook, paywalls)
  • Store cookies
  • Submit forms

Clone this wiki locally