Saturday, December 17, 2016

Talend project to parse a webpage (Zacks.com)

Created another interesting Talend project over the weekend. This Talend job parses zacks.com webpage to extract zacks scores and then convert them to rows that can be used in other components. tHTMLParse compent use to parse the website is available in Talend's exchange (market place) for free. String manipulation consumed majority of my time on this project. I intend to extend this project in future


Here is the code that goes in tJavaRow component that extracts only ratings out of the whole page and returns a string of ratings separated by semicolon.

/* -- Code from https://www.youtube.com/channel/UCT3bqK2QL93j-IFYFYbvjWQ ---- */ 

String wholepage; 
String ratings; 
wholepage=input_row.document.toString(); 
int pos=wholepage.indexOf("composite_val"); 
ratings=wholepage.substring(pos,pos+250).replaceAll("[\\[\\]\"]", "").replaceAll(" \n", " ").replaceAll(" composite_val_vgm",""); 
//output_row.document = ratings; 
String allratingsonly=""; 
String[] splitratings = ratings.split("composite_val>"); 
for (String eachratingrow : splitratings) 
   if (eachratingrow.length()>0)
   { allratingsonly=allratingsonly+";"+ eachratingrow.charAt(0)+"";     //allratingsonly=allratingsonly+eachratingrow+"**;"; 
    } 
output_row.document=allratingsonly;
/* - End of Code from https://www.youtube.com/channel/UCT3bqK2QL93j-IFYFYbvjWQ --*/