Attempting new grabber for ontv.dk (1 Viewer)

thomas.p

Portal Member
February 10, 2009
9
0
Home Country
Denmark Denmark
Out of pure ignorance I decided to try out the task of writing a new WebEPG grabber for ontv.dk since the old one is defunct and ontv.dk seems to be one of the most comprehensive TV-guides in DK.

So far I got the basics going mostly thanks to the guides on the site and snooping in working grabber files. However I need some help to make this work as well as intended.

This is what my xml file looks like so far (the file is attatched!):

Code:
<?xml version="1.0" encoding="utf-8"?>

<Grabber>

	<Info language="da" availableDays="14" timezone="W. Europe Standard Time" version="0.0.3" />

	<Channels>
		<Channel id="dr2@dr.dk" siteId="2" />	
	</Channels>

	<Listing type="Html">

		<Site url="http://ontv.dk/tv/[ID]/[YYYY]-[MM]-[DD]" post="" external="false" encoding="" />

		<Html>

			<Template name="default" start="&lt;div class=&quot;content&quot; id=&quot;content&quot;&gt;" end="&lt;tr class=&quot;bottom&quot;&gt;">
				<SectionTemplate tags="TPA">
					<TemplateText>
						&lt;td&gt;&lt;p&gt;&lt;#START&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;a&gt;&lt;#TITLE&gt;&lt;/a&gt;&lt;/p&gt;&lt;/td&gt;
					</TemplateText>
				</SectionTemplate>
			</Template>
			
			<Template name="end" start="&lt;div class=&quot;content&quot; id=&quot;content&quot;&gt;" end="&lt;tr class=&quot;function&quot;&gt;">
				<SectionTemplate tags="THP">
					<TemplateText>
						&lt;td&gt;&lt;h1&gt;&lt;/h1&gt;&lt;p&gt; - &lt;#END&gt; på &lt;br/&gt;&lt;/p&gt;
					</TemplateText>
				</SectionTemplate>
			</Template>
			
			<Template name="subtitle+description+genre" start="&lt;div class=&quot;content&quot; id=&quot;content&quot;&gt;" end="&lt;tr class=&quot;function&quot;&gt;">
				<SectionTemplate tags="HP">
					<TemplateText>
						&lt;h2&gt;&lt;/h2&gt;&lt;h3&gt;&lt;/h3&gt;&lt;p&gt;&lt;br/&gt;&lt;br/&gt;(&lt;#SUBTITLE&gt;)&lt;br/&gt;&lt;br/&gt;&lt;#DESCRIPTION&gt;&lt;/p&gt;&lt;p&gt;&lt;br/&gt;&lt;br/&gt;Type:&lt;/strong&gt; &lt;#GENRE&gt;&lt;/p&gt;
					</TemplateText>
				</SectionTemplate>
			</Template>

			<Sublinks>
				<Sublink search="programinfo" template="end">
					<Link url="http://ontv.dk/[1]" post="" external="false" encoding="" />
				</Sublink>
				<Sublink search="programinfo" template="subtitle+description+genre">
					<Link url="http://ontv.dk/[1]" post="" external="false" encoding="" />
				</Sublink>
			</Sublinks>

		</Html>

	</Listing>

</Grabber>

Getting WebEPG to grab TITLE and START time from the main page was easy pie, but moving on from there turned out to be more confusing.

Each program listed on the main page links to a sub-page with descriptions and other stuff - the contents depends on the type of program (this is a problem in itself that I will get back to later). As I tried to add more stuff my grabber somehow lost the ability to grab stuff from the sub-page. This is the problem I am working on right now, but if anybody sees the glaring error in the xml file please feel free to give me a tip :D

Due to the diversity of content on the sub-pages I am confused whether it makes more sense to use several templates to look for different info under different circumstances or whether it is possible to do much the same by use of the Searches/Search stuff that can be added after Sublinks. Either way I have not been able to figure out if it is even possible to have several (as in >2) templates grabbing info from the web pages. Anybody with any experience in this field please tip me off, send me a file with an obscure example .. anything please!

While trying to figure out how grabber files work I've been using WebEPG Designer, which helps a lot, but some concepts elude me. The start and end conditions of the templates are confusing me since WebEPG Designer accepts HTML tags whereas the final xml file reports an error in my browser if I copy them directly. Transforming them into the same gibberish used in TemplateText makes the xml file acceptable, but leaves me with no way of testing whether the start and end conditions are actually working...

I will probably uncover many more problems as I move along, but these are the most annoying issues for now. Hope somebody can lend a hand testing or point me to solutions for some of my problems :)
 

Attachments

  • ontv.xml
    30.6 KB

thomas.p

Portal Member
February 10, 2009
9
0
Home Country
Denmark Denmark
After analyzing different layouts of the program description page it is clear that I need the grabber to try out different templates on the same page until one fits or all fails ..

From the documentation I have found it is not clear how <template>'s are meant to work in conjunction with the <Sublinks>. I have been able to make one <Sublink> call a second <Template> (apart from the default one) to check out the secondary page with program description. However it is not clear whether I should make a <Sublink> for each template I want to try out on the page and a <Template> to go with it or I should simply add another <SectionTemplate> to the <Template> already in use.

My experiments have all been inconclusive in the sense that I get failed readings in all cases where the secondary <Template> should have been attempted. Something is wrong :oops:

If anybody have a clue how WebEPG make use of the different parts of the grabber (and hence how I should write the grabber to be usable in a variable environment) please drop a few words explaining. It would be nice to move from obscurity to only slightly incomprehensible ...
 

thomas.p

Portal Member
February 10, 2009
9
0
Home Country
Denmark Denmark
Slightly incomprehensible has been achieved! Progression on the grabber is slow but steady - it helped when I found more documentation that I had not discovered the first time around.

At this point I have identified the page structure that I need to read each page. However this shows that "regular programs" and series/movies have very different page layouts. I have succesfully made a working template for each type (attatched below) but failed at integrating them over and over. Any help testing those two grabbers and identifying problems would be greatly appreciated!

The HTML structure I need to search looks like this:

Code:
<div><table><tr><td><table><tr><td><h1><#TITLE></h1><p>kl. <#START> - <#END> p</p>

<!-- BEGIN SERIES/MOVIE -->
<z(>
<table><tr><td><p></p></td><td></td><td><p></p></td><td></td><td><p></p></td><td></td><td><p></p></td></tr></table><div></div><div>

<!-- BEGIN MAIN INFO -->
<z(>
<z(><h2></h2></z)?>
<z(><h3></h3></z)?>
<z(><p><#DESCRIPTION></p></z)?>
<!-- END MAIN INFO -->

<!-- BEGIN EXTRA INFO -->
<z(>
<z(><h3></h3></z)?>
<z(><p></p></z)?></z)?>
</z)?>
<!-- END EXTRA INFO -->
</z)?>

<p><#GENRE:Type: ,></p>
</div><div></div><div></div><div></div>
</z)?>
<!-- END SERIES/MOVIE -->

<!-- BEGIN REGULAR PROGRAMS -->
<z(>
<z(><h3></h3></z)?>
<z(><p><#DESCRIPTION></p></z)?>
<p><#GENRE:Type: ,></p>
</z)?>
<!-- END REGULAR PROGRAMS -->

</td>

There is a bunch of <div>, <table>/<tr>/<td>, <p> and <h> tags that can be navigated. Under some circumstances there are also a few <img> tags, but those seem too unreliable to be of use. I am guessing that this overall layout is rendered unusable in its entirety (hence the two different grabbers) by the many <z> tags I try to apply. It is an unfortunate consequence of my lack of knowledge about the regular expressions they put to use. So I could really use some guidance in the use of <z> tags and the arcane art of regular expressions in the context of grabber files.

Also I am still wondering whether a <Template> can make use of more than one <SectionTemplate> and how that is achieved. If that could work I would probably not have to delve too deeply into regular expressions to achieve a single working grabber file. But frankly speaking I am clueless whether such features exist, so if anybody reading this know anything about it, please speak up.

I am thrilled that people are even reading this thread, but I could really use some feedback on this project just about now. :D
 

Attachments

  • ontv_regular.xml
    30.6 KB
  • ontv_series&movies.xml
    30.7 KB

thomas.p

Portal Member
February 10, 2009
9
0
Home Country
Denmark Denmark
I Have been looking into the use of regular expressions as per the <z> tag some more since my last post. They are beginning to make some sense, but I am still having trouble getting my experiments to take off. In the grabber I can test the different page layouts individually and it works fine. But when I apply <z> tags weird stuff happens and few things work as expected. I suspect some of my problems are due to the "greedy" nature of the regular expression algorithm, however considering my prior experience with the subject matter pretty much anything could be going on right under my nose without me knowing

At the moment I am directing my attention in two different directions. With one template I seek to describe each possible page layout in its entirety and make the grabber choose between them. The other is an attempt to generalize the problematic layouts. The first is not very elegant, but easy to decipher for future editing, whereas the other is pretty much the opposite of that. However both are "supposed" to work :(

Basically, using the DTPH tags either approach seem to apply the wrong page layout consistently. This confuses me a great deal, since I make the grabber attempt the ones with greatest complexity first and then gradually lessen it. From what I have read about regular expressions I was under the impression that doing so would counter their greediness and all would be close to Nirvana. Apparently not so...

I could use some suggestions for what to try next :D
 

Attachments

  • ontv_decipherable.xml
    30.6 KB
  • ontv_elegant.xml
    30.6 KB

thomas.p

Portal Member
February 10, 2009
9
0
Home Country
Denmark Denmark
Thanks for the 100-and-some views you guys :)
Its nice to know that there is at least some sort of interest in what I am up to even though none of you seem to have much to say about it.
No progress has been made since my last post. I am stuck at the manner in which WebEPG handles regular expressions. I have tried modifying the grabber file with little success.
All works fine if the proper layout is the one it searches through first, but the grabber fails or returns HTML tags in all other cases. Can anybody confirm this?
Either way, I have too little experience with this stuff to determine whether the grabber is flawed or WebEPG has a bug in the handling of regular expressions. I am unsure how to proceed since no suggestions for improvement or responses to my calls for otherwise competent help have surfaced...
 

James

Retired Team Member
  • Premium Supporter
  • May 6, 2005
    1,385
    67
    Switzerland
    Sorry Thomas for not replying sooner. I have just been crazy busy in the last while.

    At the moment the system only really supports two template a main one and a link page one. The file format can support multiple and I had planned for this but I never got around to adding support for it and until now no one really asked for it ;-)

    I will look at the code and see if support for mutliple template is a lot of work and and I will also look at the regex stuff too.

    Cheers,

    /James
     

    thomas.p

    Portal Member
    February 10, 2009
    9
    0
    Home Country
    Denmark Denmark
    This is great :D

    Looking forward to get some feedback once you've had an opportunity to look into my work.
     

    Users who are viewing this thread

    Top Bottom