0

automated "screen scrape"

asked 2012-02-01 15:21:20 +0800

dimitribalios gravatar image dimitribalios
18

does anyone know of a strategy to emulate a "screen scrape" by executing some javascript to access the XML that is returned by the web server?

delete flag offensive retag edit

11 Replies

Sort by ยป oldest newest

answered 2012-02-01 15:29:31 +0800

gganassin gravatar image gganassin flag of Luxembourg
540 6
http://www.hybris.com/

Why do you want to struggle with javascript when you can simply parse the desktop of a given user server side starting from the first root component?

link publish delete flag offensive edit

answered 2012-02-01 18:20:13 +0800

dimitribalios gravatar image dimitribalios
18

It is only because i am ignorant of what is the best way to get the data. i need to log in to a website, pass a parameter and scrape the results. the page has a .zul file extension

link publish delete flag offensive edit

answered 2012-02-02 03:27:39 +0800

RichardL gravatar image RichardL
768 4

Even through the ZUML file is written by the developer in XML, once it has been through the loader the resulting web page is HTML so it should be able to be scraped as usual.

link publish delete flag offensive edit

answered 2012-02-02 06:28:15 +0800

dimitribalios gravatar image dimitribalios
18

One would imagine that everything would be rendered in HTML, but not so. Here is the HTML of the page. the content. Iwante to put up a screenshot of the page to see the content , but I dont see an option on this forum to upload the screenshot:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="-1" />
<title></title>

<link rel="stylesheet" type="text/css" href="/apex/zkau/web/5886ac79/zul/css/zk.wcs"/>
<link rel="stylesheet" type="text/css" href="/apex/zkau/web/5886ac79/zebraTheme/img.css.dsp"/>

<script type="text/javascript" src="/apex/zkau/web/5886ac79/js/zk.wpd" charset="UTF-8">
</script>
<script type="text/javascript" src="/apex/zkau/web/5886ac79/js/zul.lang.wpd" charset="UTF-8">
</script>
<!-- ZK 5.0.3 EE 2010062914 -->
</head>
<body>
<div id="z_d__0" class="z-temp"></div>
<script>zkmb();try{zkx(
[0,'z_d__0',{id:'pageHome',dt:'zd_tu91',cu:'/apex',uu:'/apex/zkau',ru:'/capHomeView.zul'},[
['zul.wnd.Window','z_d__1',{id:'homeView',$$onSize:false,$$onMaximize:false,$$onOpen:false,$$onMinimize:false,$$onZIndex:false,$onClose:true,$$onMove:false,width:'100%'},[
['zul.wgt.Include','z_d__2',{$$onSize:false,prolog:'\n\n '},[
['zul.wnd.Window','z_d__3',{$onClose:true,width:'100%'},[
['zk.Native','z_d__4',{prolog:'\n\n <table class="zebra-top-bar" border="0" width="100%" cellspacing="0" cellpadding="0">\n <tr>\n ',epilog:'\n </tr>\n </table>'},[
['zk.Native','z_d__5',{prolog:'<td width="100%" align="right">\n ',epilog:'\n </td>'},[
['zul.wgt.Div','z_d__6',{id:'loginInformation',$$onSize:false,sclass:'zebra-top-bar-info',align:'right'},[
['zul.wgt.Label','z_d__7',{id:'userId',$$onSize:false,prolog:'\n ',value:'User Id : customer07'},[]],
['zul.wgt.Separator','z_d__8',{$$onSize:false,prolog:'\n ',spacing:'5px',orient:'vertical',bar:true},[]],
['zul.wgt.Toolbarbutton','z_d__9',{id:'logout',$onClick:true,$$onSize:false,sclass:'zebra-login-link',prolog:'\n ',label:'Log Out'},[]]]]]],
['zk.Native','z_d__a',{prolog:'\n '},[]],
['zk.Native','z_d__b',{prolog:'<td align="right">\n ',epilog:'\n </td>'},[
['zul.wgt.Div','z_d__c',{sclass:'zebra-top-bar-custlogo'},[
['zul.wgt.Image','z_d__d',{$$onSize:false,prolog:'\n ',src:'/apex/customer_logo.png'},[]]]]]]]],
['zul.menu.Menubar','z_d__e',{$$onSize:false,width:'100%',prolog:'\n '},[
['zul.menu.Menu','z_d__f',{$$onClick:false,$$onSize:false,$$onChange:false,label:'Container'},[
['zul.menu.Menupopup','z_d__g',{visible:false,$$onOpen:false,$$onSize:false},[
['zul.menu.Menuitem','z_d__h',{$onClick:true,$$onSize:false,$$onCheck:false,label:'Containers'},[]]]]]],
['zul.menu.Menu','z_d__i',{label:'Gate'},[
['zul.menu.Menupopup','z_d__j',{visible:false},[
['zul.menu.Menuitem','z_d__k',{$onClick:true,label:'Pre-advised Containers'},[]]]]]],
['zul.menu.Menu','z_d__l',{label:'Vessel'},[
['zul.menu.Menupopup','z_d__m',{visible:false},[
['zul.menu.Menuitem','z_d__n',{$onClick:true,label:'Vessel Visits'},[]]]]]]]]]]]],
['zul.wgt.Div','z_d__o',{id:'content',width:'100%',prolog:'\n\n '},[
['zul.wgt.Include','z_d__p',{prolog:'\n \n\n \n '},[
['zul.wnd.Window','z_d__q',{$onClose:true,width:'100%'},[
['zul.wgt.Div','z_d__r',{sclass:'zebra-welcome',prolog:'\n\n '},[
['zul.wgt.Div','z_d__s',{sclass:'zebra-welcome-inner',prolog:'\n '},[
['zul.wgt.Div','z_d__t',{sclass:'zebra-welcome-logo',prolog:'\n '},[]]]]]]]]]]]]]]]]);
}finally{zkme();}</script>
<noscript>
<div class="noscript"><p>Sorry, JavaScript must be enabled.<br/>Change your browser options, then <a href="">try again</a>.</p></div>
</noscript>

</body>
</html>

link publish delete flag offensive edit

answered 2012-02-02 06:51:32 +0800

ashishd gravatar image ashishd flag of Taiwan
1972 6

Yes HTML DOM is constructed at the client side. Sever only sends JSON data and instruction for ZK client engine to construct UI on the client side.

link publish delete flag offensive edit

answered 2012-02-02 07:53:45 +0800

dimitribalios gravatar image dimitribalios
18

So is it not possible to call or intercept the JSON data programmatically and insert it into a database, for example, since it is not possible to "scrape" the html

link publish delete flag offensive edit

answered 2012-02-02 08:03:05 +0800

ashishd gravatar image ashishd flag of Taiwan
1972 6

<?xml version="1.0" encoding="UTF-8"?>
<?script src="~./js/zk.debug.wpd"?>
 
<div xmlns:w="client">
          <button label="dump dom" w:onClick="zDebug.dumpDomTree(this.parent)"/>
          <button label="dump wgt" w:onClick="zDebug.dumpWidgetTree(this.parent)"/>
</div>

link publish delete flag offensive edit

answered 2012-02-02 08:17:46 +0800

dimitribalios gravatar image dimitribalios
18

Hi ashish

I cannot change the code of the webpage - so can you explain a bit more how this will achieve my goal?

link publish delete flag offensive edit

answered 2012-02-02 10:49:05 +0800

RichardL gravatar image RichardL
768 4

Are you developing the pages that you want to be scraped? If so, have you tried configuring "crawlable" in zk.xml?

<system-config>
   <crawlable>true</crawlable>
</system-config>

Other than that, you could build your own native components which are sent straight to the client in HTML with no accompanying javascript.

link publish delete flag offensive edit

answered 2012-02-02 12:33:56 +0800

dimitribalios gravatar image dimitribalios
18

I have no access to the code of the pages to be scraped.

I know nothing about zk. I saw that the filename extension of the page to be scraped is .zul, and Google brought me to this forum.

I came to this forum looking for some guidance on how to access the data (content to be scraped) as the data is not rendered in plain html.

Since putting up this post I have read up on zk, and it would apear that zk has some kind of client control that lives in the browser, and the data is moved around in JSON

So my mission is to find how my windows service can log in to the page, pass a parameter, and get the JSON data

link publish delete flag offensive edit
Your reply
Please start posting your answer anonymously - your answer will be saved within the current session and published after you log in or create a new account. Please try to give a substantial answer, for discussions, please use comments and please do remember to vote (after you log in)!

[hide preview]

Question tools

Follow

RSS

Stats

Asked: 2012-02-01 15:21:20 +0800

Seen: 389 times

Last updated: May 16 '12

Support Options
  • Email Support
  • Training
  • Consulting
  • Outsourcing
Learn More