Extract website data in Java using webscrap4j library

Web scraping (also called Web harvesting or Web data extraction) is a technique of extracting information from websites. It describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.

Read this article : Twitter4J
How to Extract website data in Java using webscrap4j library,Extract website data in Java using webscrap4j library,Extract website data in Java using webscrap4j library,data in Java using webscrap4j library,Java using webscrap4j library,webscrap4j library,

Using web scraper, you can extract the useful content from the web page and convert into any format as applicable.
WebScrap ws= new WebScrap();
//set your extracted website url
ws.setUrl("http://dasnicdev.github.io/webscrap4j/");
//start scrap session
ws.startWebScrap();

Read also : Make Keylogger in Java

Now your web-scrapping session start and ready to scrap or extract data in java using webscrap4j library.

 For Title :

System.out.println("-------------------Title-----------------------------");
System.out.println(ws.getSingleHTMLTagData("title"));

For Tagline :

System.out.println("-------------------Tagline-----------------------------");
System.out.println(ws.getSingleHTMLScriptData("<h2 id='project_tagline'>", "</h2>"));

For All anchor tag :

System.out.println("-------------------All anchor tag-----------------------------");
al=ws.getImageTagData("a", "href");
for(String adata: al)
{
System.out.println(adata);
}

For Image data :

System.out.println("-------------------Image data-----------------------------");
System.out.println(ws.getImageTagData("img", "src"));
System.out.println(ws.getImageTagData("img", "alt"));

For Ul-Li Data :

System.out.println("-------------------Ul-Li Data-----------------------------");
al=ws.getSingleHTMLScriptData("<ul>", "</ul>","<li>","</li>");
for(String str:al)
{
System.out.println(str);
}

Full Source Code : 


import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.webscrap4j.WebScrap;

public class CurlTest1 {
/**
* @param args
*/
public static void main(String[] args) {
try
{
ArrayList<String> al=new ArrayList<String>();
WebScrap ws= new WebScrap();
ws.setUrl("http://dasnicdev.github.io/webscrap4j/");
ws.startWebScrap();
System.out.println("-------------------Title-----------------------------");
System.out.println(ws.getSingleHTMLTagData("title"));
System.out.println("-------------------Tagline-----------------------------");
System.out.println(ws.getSingleHTMLScriptData("<h2 id='project_tagline'>", "</h2>"));
System.out.println("-------------------All anchor tag-----------------------------");
al=ws.getImageTagData("a", "href");
for(String adata: al)
{
System.out.println(adata);
}
System.out.println("-------------------Image data-----------------------------");
System.out.println(ws.getImageTagData("img", "src"));
System.out.println(ws.getImageTagData("img", "alt"));
System.out.println("-------------------Ul-Li Data-----------------------------");
al=ws.getSingleHTMLScriptData("<ul>", "</ul>","<li>","</li>");
for(String str:al)
{
System.out.println(str);
}
}
catch(WebScrapException e)
{
e.printStackTrace();
}

}
}

No comments:

Post a Comment