Ready for our first proper node.js Script!

In a previous post, we learned about some tools that helped us create a script in node.js. It is now time to put this into practice by implementing a script that connects to a few online newspapers, searches in the news for specific keywords and returns those articles.

Our new script will need to accept the following parameters:

  • A file with the list of newspapers (one URL per line)
  • A file with a list of keywords (a keyword per line)

First, let’s create the following files: news_watcher.js and package.json. Make sure you remember to add the execution rights to your file. We will use three external modules and make sure they are added to our package.json (see Part 1 for details).

The initial package.json should look like this in its empty state:

{
"name": "news­logger",
"version": "0.1.0",
"description": "Access multiple newspapers and find news using specific keywords",
"author": "Raul Martin"
}

Then, you need to add the dependencies as follows:

npm install cheerio ­­save npm install request ­­save npm install commander ­­save

As you can see, we will use the Cheerio, Request and Commander modules. You already know about Commander (see Part 1 if you don’t). We’ll use Request to easily access content from URLs (based on a callback function). Finally, Cheerio is a great library that creates a DOM from a string and allows you to use some JQuery functionalities from then on. It can be very useful to manipulate HTML and web scraping.

Here’s what I came up with:

#!/usr/bin/env node
/*jshint node: true */
"use strict";

//get the external modules
var fs = require("fs"),
  request = require('request'),
  cheerio = require('cheerio'),
  program = require('commander');

//set the params options
program
  .version('0.1.0')
  .usage('­­newspapers newspapers.txt ­­keywords keywords.txt')
  .option('­n, ­­newspapers ', 'Newspapers list separated by \'\\n\'')
  .option('­k, ­­keywords ', 'Keywords list separated by \'\\n\'')
  .parse(process.argv);

var newspapersFile = program.newspapers,
  keywordsFile = program.keywords;

if (!newspapersFile || !keywordsFile) {
  program.help();
}

//Helpers
var file2Array = function(fileName){
  return (
    fs.readFileSync(fileName, "utf8")
      .split('\n')
      .map(function(value) {
        return value.trim();
      })
      .filter(function(element){
        return !!element;
      })
  );
};

var getFullUrl = function(baseUrl, url) {
  url = url.trim();
  if (url.indexOf("http") !== 0) {
    url = baseUrl + url;
  }
  return url;
};

//The script

var duplicateControl = {},
  completed_request = 0,
  result = [],
  newspapers = file2Array(newspapersFile),
  keywords = file2Array(keywordsFile)
    .map(function(value) {
      return value.toLowerCase();
    });

var addNews = function(title, url, keyword) {
  if (!duplicateControl[url]) {
    duplicateControl[url] = true;
    result.push({
      'url': url,
      'title': title.trim(),
      'keyword': keyword
    });
  }
};

var processRequest = function(url, error, response, html) {
  if (!error && response.statusCode === 200) {
    var $ = cheerio.load(html);

    $('a').each(function() {
      var a = $(this),
        text = a.text().toLowerCase(),
        href = a.attr('href');

    if (!href) {
      return;
    }

    href = getFullUrl(url, href);

  //using every to stop after I match with a keyword
  keywords.every(function(keyword) {
    if (text.indexOf(keyword) !== ­1) {
      addNews(a.text(), href, keyword);
      return false;
    }
    return true;
    });
  });
  }
  completed_request++;
  if (completed_request === newspapers.length) {
  console.log(result);
  }
};

newspapers.forEach(function(url){
  request(url, processRequest.bind(this, url));
});

To be able to test this script, we can start with the following input files:

newspapers.txt

http://www.irishtimes.com 
http://www.irishexaminer.com 
http://www.irishmirror.ie

keywords.txt

week 
government

And finally, here is what happens when you run it!

./news_watcher.js

Usage: news_watcher ­­newspapers newspapers.txt ­­keywords keywords.txt Options: 
­h, ­­--help output usage information 
­V, ­­--version output the version number 
­n, ­­--newspapers Newspapers list separated by '\n' 
­k, --­­keywords Keywords list separated by '\n'

./news_watcher.js ­n newspapers.txt ­k keywords.txt

[ { url: 'http://www.irishtimes.com/culture/tv­radio­web/tv­preview­six­shows­to­watch­this­week­1.2379 
320', ... ]

Potential Improvements

  1. You can add a library for promises. Then you won’t have to use the hacky condition to print the results, for example with promised­io.
  2. Maybe, you can directly print the result in a human readable form. I like printing them as an array so that I can re­use it in another node.js script as input (see Pipes section).
  3. The newspapers.txt could be a structure with the URL and the specific link selector to get only the news section of the newspaper (less noise).
  4. You should consider error handling.
  5. You can add for the Logentries logger in order to get your logs in your Logentries account (https://github.com/logentries/le_node).

Conclusion

If you are need a scripting language to run from your command line and you feel strong using Node.js, I think I have given you a really interesting option to create, run and even share your scripts. Practically speaking, I use this a lot to create quick tests and benchmarking scripts as I know I can leverage javascript capabilities fast and bring complex algorithms to my shell.

The other great aspect is to be able to use all those external modules: npm oficial page.

Here is a list of the ones that I like and use regularly:

Finally, if you ever end up doing katas to improve your program skills, this is a really nice way to get going fast! (don’t forget your unit tests).


Ready to start getting insights from your applications? Sign up for a Logentries free trial today.