Node.js Web Scraping (Step-By-Step Tutorial)

In Web Development


Node.js Web Scraping (Step-By-Step Tutorial) - read the full article about Node Js update, Web Development and from Oxylabs on Qualified.One
alt
Oxylabs
Youtube Blogger
alt

Hello everybody I am Daniel and today we’ll be  looking at a real life scenario of how to scrape with Node JS. But before we start, if  you’re curious to find out more about similar topics or simply anything data gathering,  do subscribe and check out our channel  Now, lets take a quick look at what Node really is. In simple terms, Node.js is a server side platform that is based around JavaScript to provide  building blocks for scalable and event driven applications. This is very useful for developers  due to a multitude of reasons. Some core ones are its ability to be easily learnt, as hopefully this  video will illustrate, and how it is able to keep things simple. Primary example of its simplicity  is how it requires less files and code for backend tasks when compared to other languages. So, when it comes to scraping, Node is often the primary choice simply because of how beneficial  it is, whether that is in its ability to scale, its notable support community,  or its unending customizability.  Also, the speed at which you can collect  data from many different websites, with Node, is often unparalleled. Cost cuts from not having  heavy computer usage are significant as well.  So, now that we have briefly looked at what  Node.js is and why its so popular with scraping its time to look at a practical  tutorial that will show all of the parts you’d have to go through if  you were setting Node.js up yourself.  Firstly, there are only two pieces of software  that will be needed, Node.js, which comes with npm—the package manager, and any code editor.  Secondly, you should also know that Node.js is that it is a runtime framework. This simply means that JavaScript code,  which typically runs in a browser, can run without a browser. When it comes to  operating systems, Node.js is available for Windows, Mac OS, Linux and It can all be  downloaded at the official download page.  Now, before you write any code to scrape  the web using Node js, create a folder where project files will be stored. These files will  contain all the code required for web scraping.  Once the folder is created, navigate to  it and run the initialization command:  This will create a package.json file in  the directory. This file will contain information about the packages that  are installed in this folder. The next step is to install the Node.js packages. For Node.js web scraping, we need to use certain packages, also known as libraries. These  libraries are prepackaged code that can be reused.  To install any package, simply run npm install. For example, to install the package axios, run this on your terminal: Then, run the following command to install all the packages used in this tutorial: This command will download the packages to the node_modules directory  and update the package.json file.  Continuing, it should be mentioned that almost  every web scraping project using Node.js or JavaScript should involve three basics steps: First, sending the HTTP request, then parsing the HTTP response, extracting  desired data and saving the data in some persistent storage, e.g. file or database. The following sections will demonstrate how Axios can be used to send HTTP requests, cheerio  to parse the response and extract the specific information that is needed, and, finally,  save the extracted data to CSV using json2csv.  The first step of web scraping with JavaScript  is to find a library that can send HTTP requests and return the response. Even though request and  request-promise packages have been quite popular in the past, these are now seen as somewhat  outdated, although you’ll probably still find many examples of old code using them.  With millions of downloads every day, Axios is a good alternative. It fully supports  Promise syntax as well as async-await syntax.  Another benefit of Node.js is that it works well  with the useful package, Cheerio. This package is valuable because it converts the raw  HTML captured by Axios into something that can be queried using a jQuery-like syntax. JavaScript developers are usually familiar with jQuery. This makes Cheerio a very good  choice to extract the information from HTML.  One of the most common scenarios of web  scraping is to scrape e-commerce stores. A good place to start is a fictional book store  http://books.toscrape.com/. This site is very much like a real store, except that this is fictional  and is made to allow you to learn web scraping.  Before beginning JavaScript web  scraping, selectors must be created. The purpose of selectors is to identify  the specific element to be queried.  Begin by opening the URL The simplest way to create a selector is to  right-click this h1 tag in the Developer Tools, point to Copy, and then click Copy Selector.  This will create a selector like this:  This selector is valid and works well. The  only problem is that this method creates a brittle selector, meaning any change in  the layout would require you to update the selector as it is so specific. Sometimes,  this makes it difficult to maintain the code.  Nevertheless, after spending some time with the  page, it becomes clear that there is only one h1 tag on the page. This makes it very  easy to create a rather short selector:  Alternatively, a third-party tool like  Selector Gadget extension for Chrome can be used to create selectors very quickly. This  is a useful tool for web scraping in JavaScript.  Note that while this works most of the time, there  will be cases where it does not. Understanding how CSS selectors work is always a good idea.  W3Schools has a good CSS Reference page.  Now you should look at scraping the genre. The  first part is to define the constants that will hold a reference to Axios and Cheerio. The address of the page that is being scraped is saved in the variable URL for readability Axios has a method get() that sends an HTTP GET request. Note that this is an asynchronous  method and thus needs await prefix:  If there is a need to pass additional  headers, for example, User-Agent, it can be sent as the second parameter: This particular site does not need any special headers, which makes it easier to learn. Now, the response has a few attributes like: headers, data, etc. The HTML that  we want is in the data attribute. This HTML can be loaded into an object that can  be queried, using the cheerio.load() method.  Cheerio’s load () method returns a reference to  the document, which can be stored in a constant. This can have any name. To make our web  scraping code look and feel more like jQuery a (sign like this) $ can be used as a name. Finding a specific element within the document is as easy as writing $(“”). In  this particular case, it would be this: $(“h1”).  text() has plenty of benefits when writing  web scraping code with JavaScript, as it can be used to get the text inside any element. It  can be extracted and saved in a local variable.  Finally, console.log() will simply  print the variable value to the console.  To handle errors, the code will be  surrounded by a try-catch block. Note that it is a good practice to use console.error  for errors and console.log for other messages.  Here is the complete code put together. Save  it as genre.js in the folder created earlier, where the command npm init was run. The final step is to run it using Node.js. Open the terminal and run this command: The output of this code is going to be the genre name: Congratulations! This was the first program that uses JavaScript and Node.js  for web scraping. Time to do more complex things!  Let’s try scraping listings. Here is the same page  that has a book listing of the Mystery genre – First step is to analyze the page and understand  the HTML structure. Load this page in Chrome, press F12, and examine the elements. Each book is wrapped in the tag and a loop can be run to extract individual book  details. If the HTML is parsed with Cheerio, this function ( jQuery function each() ) can be used to  run a loop. Now, let’s start with extracting the titles of all the books. Here is the code: As is shown from the above code, the extracted details need to be  saved somewhere else inside the loop. The best idea would be to  store these values in an array. In fact, other attributes of the books can be  extracted and stored as a JSON in an array.  Here is the complete code. Create  a new file, paste this code and save it as books.js in the project folder: Run this file using Node.js from the terminal:  This should print the array of books on  the console. The only limitation of this JavaScript code is that it is scraping  only one page. The next section will cover how pagination can be handled. The listings are usually spread over multiple pages. While every site may have its own way of  paginating, the most common one is having a next button on every page. The exception would be the  last page, which would not have a next page link.  The pagination logic for these  situations is rather simple. Start by creating a selector for the next page  link. If the selector gives you a value, take the href attribute value and call getBooks  function with this new URL recursively.  Immediate after the books.each()  loop, add these lines:  Note that the href returned  above is a relative URL. To convert it into an absolute URL, the simplest  way is to concatenate a fixed part to it. This fixed part of the URL is  stored in the baseUrl variable  Once the scraper reaches the last page, the Next  button will not be there and the recursive call will stop. At this point, the array will have book  information from all the pages. The final step of web scraping with Node.js is to save the data. Interestingly, web scraping with JavaScript is somewhat easy yet saving data  into a CSV file is even easier. It can be done using these two packages —fs  and json2csv. The file system is represented by the package fs, which is built in. Json2csv  would need to be installed using this command:  After the installation, create a constant that  will store reference to this package’s Parser.  Access to the file system is needed to write the  file on disk. For this, initialize the fs package.  Find the line in the code where an array with all  the scraped data is available, and then insert the following lines of code to create the CSV file.

Here is the complete script put together.

This can be saved as a .js file in the node.js  project folder. Once it is run using node command on terminal, data from all the  pages will be available in books.csv file.  Run this file using Node.js from the terminal: We now have a new file books.csv, which contains all the desired data. This can be viewed using  any spreadsheet program such as Microsoft Excel.  This whole exercise of web scraping  using JavaScript and Node.js can be broken down into three steps — sending the  request, parsing and querying the response, and saving the data. For any of these three steps,  there are many packages available . Though, in this tutorial, we focused on how to use Axios, Cheerio,  and Json2csv packages for these primary tasks.  If you have any questions about scraping  with Node.js or would like to know more about related topics, feel free to contact us  at [email protected] or leave a comment below.  If you appreciate our content, hit that like  button and share this video on your social media!  Thank you for tuning in, this was  Oxylabs, and we hope to see you next time!

Oxylabs: Node.js Web Scraping (Step-By-Step Tutorial) - Web Development