Crawl Web Pages, Make XLS File and Email in NodeJS - Lab Task 5


Background

These days, huge volume of data is produced that is difficult to process by humans manually. So people make use of automation tools i.e. softwares, to process, filter and understand stuff of their interest. For example, a second-hand mobiles' shopkeeper may be interested in some specific models and makes of mobiles that seller uploads under specific price range from a specific city. To filter such ads, he has to spend a lot of time daily on ads website e.g. olx. What he can do is, make use of software to filter all such ads for him.

In data processing projects category on freelancing websites, many clients ask to scrap data from given website/s, filter data of their interest, create XLS or PDF files and to email them or upload at their FTP server, periodically ... all automatically i.e. using a program. Though static typed languages (Java, C# etc.) are better choice and has lot of libraries for such work but as NodeJS is on rise, so in future, there may be lot of data processing work to be done in NodeJS as people (developers and clients) try to continue with their core technology stack as long it can serve the purpose. I see, many libraries are on rise on NPM that are helpful for data processing work. Its time to get some experience with such libraries.

Task Description

By now, you must have got some understanding of how to use JavaScript and Node packages. Its time to make something useful using them. Here is the folow that you shall implemented in your lab task:
  1. Scrap few smartphone pages from ads or online store (I suggest you to use myshop.pk as their pages' structure is simple and they do not make use of AJAX, unlike olx). You can store URLs of the pages in a string array or get all product URLs from product listing page. (must scrap respectfully, do not send too much traffic to their server, add some internevals between calls, as their are rules that all developers and search engines follows when their crawl the web).
  2. Extract some attributes of each scrapped product e.g. price, make, model, color, RAM, weight etc. Assume you scrapped 100 products, now filter some products of your interest. For the time being, you can hard code the required parameter values that to filter from complete data set. Lets see, form 100 downloaded, 30 met your defined parameters of interest. Use any server side DOM parser to scrap and filter required data from downloaded pages' HTML. To download/parse, I suggest you to use jsdom or HTML DOM Parser or whatever got working with successfully.
  3. After data of interest is filtered, create an excel file of that data (field listed above). You can use exceljs or any other package that seems appropriate. This article seems helpful.
  4. Once the XLS or XLSX file is created, send that file via email. You can send to yourself for testing purpose. I suggest you to use nodemailer or nestjs' mailer. This article seems helpful.
(Keep this code safe, you would need it one day in your professional project. Personal experience, have used many times)

Comments