NodeJS Tutorial: Parse HTML From URL

This guide will go over 2 things: Getting the raw HTML from a URL. And then parsing the HTML we have got from the URL. In the end, you will know how to parse HTML from any URL using NodeJS.

The Demo Project We Will Work On

We are going to make a NodeJS script that gives all the details about the ratings and reviews stats of a particular product on Amazon.

For doing this, we will use the following section on the Amazon website..

When the user hovers over the star rating section. The Amazon website makes an AJAX request to fetch the reviews and ratings. The HTML we are interested in shows in a pop-up.

Step 1: Identify The URL

We are going to use Firefox (or Chrome) developer tools to help us zero in on the AJAX request being made to fetch the above ratings and reviews HTML data.

Below is a video where I show the process of finding the URL.

Once you have the URL, you can get Firefox to make the request code for you. All you have to do is right click on the request in the network tab and then hit “Copy As Fetch”

copy the request from the network tab as fetch to get raw html via nodejs

(Click the image to open in full size)

When you paste, what you have copied, into VSCode or Sublime, you will get something like…

await fetch(
    "https://www.amazon.in/gp/customer-reviews/widgets/average-customer-review/popover/ref=dpx_acr_pop_?contextId=dpx&asin=B07WDKLDRX",
    {
        "credentials": "include",
 
        "headers": {
            "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0",
            "Accept": "text/html,*/*",
            "Accept-Language": "en-US,en;q=0.5",
            "X-Requested-With": "XMLHttpRequest",
            "Sec-Fetch-Dest": "empty",
            "Sec-Fetch-Mode": "cors",
            "Sec-Fetch-Site": "same-origin"
    },
 
    "referrer": "https://www.amazon.in/iQOO-128GB-Storage-Snapdragon%C2%AE-FlashCharge/dp/B07WDKLDRX?pf_rd_r=60R957QGQR6ES7TNWJV7&pf_rd_p=ac6104ed-2682-452c-9db1-7d93b5cfd333&pd_rd_r=7c01d484-489f-4b29-b723-21bf661e2624&pd_rd_w=TbBjB&pd_rd_wg=RkAi5&ref_=pd_gw_unk",
   
    "method": "GET",
   
    "mode": "cors"
 
});

Besides getting ready made code, the other good thing about this technique for finding the URL is that, you get all the headers like User Agent etc. for free. All of this makes your request look like a real request from a real browser even tho its being generated programmatically.

Step 2: Fetching The HTML Data From The URL

To do this, we are going to use the node-fetch NPM package.

This package will allow us to directly use the above script in a NodeJS script. Using the above step in combination with this NPM package, we get the HTML easily.

Below is a script that brings in the node-fetch package and then makes a fetch request. After the fetch request, it stores the response HTML into a variable which we will use in the next step.

import fetch from 'node-fetch';

let response = await fetch("https://www.amazon.in/gp/customer-reviews/widgets/average-customer-review/popover/ref=dpx_acr_pop_?contextId=dpx&asin=B07WDKLDRX", 
{
    "credentials": "include",

    "headers": {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0",
        "Accept": "text/html,*/*",
        "Accept-Language": "en-US,en;q=0.5",
        "X-Requested-With": "XMLHttpRequest",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin"
    },

    "referrer": "https://www.amazon.in/iQOO-128GB-Storage-Snapdragon%C2%AE-FlashCharge/dp/B07WDKLDRX?pf_rd_r=60R957QGQR6ES7TNWJV7&pf_rd_p=ac6104ed-2682-452c-9db1-7d93b5cfd333&pd_rd_r=7c01d484-489f-4b29-b723-21bf661e2624&pd_rd_w=TbBjB&pd_rd_wg=RkAi5&ref_=pd_gw_unk",

    "method": "GET",

    "mode": "cors"
});

let html = await response.text();

Some Notes About Installation Of “node-fetch”

If you see the documentation of node-fetch, it asks you to install node-fetch into your project with the following line..

npm install node-fetch

But, also notice the warning under which says..

node-fetch installation warning about how to import the package for html fetching(Click to open the image in full size)

Basically, if you use the latest version, you will have to use the ES Modules (ESM) import syntax like so..

import fetch from 'node-fetch';

You will not be able to use the CommonJS syntax for importing using the “require()” statement.

So, choose the version you want to add to your project carefully. Have addressed it here since you might hit a snag. Also, there is a work around that talks about how to use import and require in the same NodeJS file. If that is what you need.

Step 3: Parsing & Extracting Data From The HTML Response

In order to parse the HTML, we are going to use the NPM package node-html-parser.

The package claims that its the fastest Node HTML parser. Here is a description from the NPM page..

Fast HTML Parser is a very fast HTML parser. Which will generate a simplified DOM tree, with element query support.

– Fast HTML Parser NPM Page

Okay, so install the above package and import it in the usual way. Now, we are ready to use it.

First, we are going to look at what the HTML looks like from the above step. Below is a small, formatted snippet from the HTML.

<!-- There is more HTML above. Have cut it out for clarity & focus. -->

<div class="a-icon-row a-spacing-small a-padding-none">
   <i data-hook="average-stars-rating-anywhere" class="a-icon a-icon-star a-star-4-5"><span class="a-icon-alt">4.4 out of 5</span></i>
   
   <span data-hook="acr-average-stars-rating-text" class="a-size-medium a-color-base a-text-beside-button a-text-bold">4.4 out of 5</span>
</div>

<div class="a-row a-spacing-medium"><span data-hook="total-review-count" class="a-size-base a-color-secondary totalRatingCount">14,706 global ratings</span></div>

<!-- There is more HTML below. Have cut it out for clarity & focus. -->

3 Steps To Using The node-html-parser Package:

Step 1: Load the HTML & parse it using (you only have to do this once)

const root = parse(html);

Step 2: Use CSS Selectors To Get A Particular Element

If you see the HTML, you can usually find some unique CSS class which can be used to zero in on a particular element. For example, the number of ratings HTML element is..

<span data-hook="total-review-count"
      class="a-size-base a-color-secondary totalRatingCount">
          14,706 global ratings
</span>

So, we can use the CSS selector: “span.totalRatingCount” to fetch the element. If you need to learn about all your options when using CSS selectors, you can check out this article.

const numOfRatingsElement = root.querySelector("span.totalRatingCount");

Instead of using “querySelector”, you might also use: “querySelectorAll” if you want all the matching elements. Not just the first one. Have created an example about the same below.

Step 3: Get the Text Content Of The Number Of Ratings Element

const numberOfRatingText = numberOfRatingsElement.textContent;

The Complete Code: Getting The HTML From The URL & Parsing The HTML To Get The Total Rating Count

import fetch from 'node-fetch';
// Import the libraries
import fetch from 'node-fetch';
import { parse } from 'node-html-parser';

// Use the fetch code we copy pasted from the browser
let response = await fetch("https://www.amazon.in/gp/customer-reviews/widgets/average-customer-review/popover/ref=dpx_acr_pop_?contextId=dpx&asin=B07WDKLDRX", {
    "credentials": "include",
    "headers": {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0",
        "Accept": "text/html,*/*",
        "Accept-Language": "en-US,en;q=0.5",
        "X-Requested-With": "XMLHttpRequest",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin"
    },
    "referrer": "https://www.amazon.in/iQOO-128GB-Storage-Snapdragon%C2%AE-FlashCharge/dp/B07WDKLDRX?pf_rd_r=60R957QGQR6ES7TNWJV7&pf_rd_p=ac6104ed-2682-452c-9db1-7d93b5cfd333&pd_rd_r=7c01d484-489f-4b29-b723-21bf661e2624&pd_rd_w=TbBjB&pd_rd_wg=RkAi5&ref_=pd_gw_unk",
    "method": "GET",
    "mode": "cors"
});
let html = await response.text();

// Parse the HTML
const root = parse(html);

// Get The Rating Element
const numberOfRatingsElement = root.querySelector("span.totalRatingCount");

// Extract The Text Out Of The Rating Element: 14,706 global ratings
const numberOfRatingText = numberOfRatingsElement.textContent;

Example Of Using querySelectorAll & Extracting Attribute Values Instead Of Text

In the HTML we got from the URL there seems to be a table. Each row of the table has information of about the number of ratings at a particular star level. Here is a snippet of the HTML, with the information.

<tr data-reftag=""
    data-reviews-state-param="{&quot;filterByStar&quot;:&quot;five_star&quot;, &quot;pageNumber&quot;:&quot;1&quot;}"
    aria-label="5 stars represent 65% of rating" class="a-histogram-row a-align-center">

There are many such “tr” elements in the HTML. So, we can use the following NodeJS code in order to extract the information..

// Use querySelectorAll to get all the "tr" elements in an array
const allTheRatingPercentageLinks = root.querySelectorAll('tr');

// Loop over each of the elements you got
allTheRatingPercentageLinks.forEach((element) => {

    // Extract the attribute aria-label from the elements
    console.log(element.getAttribute('aria-label'))
    
})

Notice That:

  1. That we used “querySelectorAll” to get all the “tr” elements. That gave us an array of all the “tr” elements.
  2. Also we used “getAttribute” to get the attribute called “aria-label” instead of the text content as we did above.

Since we are just printing to the console in the above code, the output looks something like this..

5 stars represent 65% of rating
4 stars represent 25% of rating
3 stars represent 5% of rating
2 stars represent 2% of rating
1 stars represent 4% of rating

Conclusion

I hope the above guide gives you all the tools, techniques and sample code you need in order to use node js to parse HTML from any URL. The technique is always the same. Use the browser to help you with the “fetch” code. Then use the “fetch” code to get the HTML you need. And finally, use the HTML parsing with CSS selectors to extract the information out of the HTML.

Hope this was useful.