[EPG] [WebEPG] why not just regex (1 Viewer)

benjerry · June 20, 2010

Hi,

I was wondering why not just use full regex expressions for template matching tv programs and where ever possible.

Current system is perhaps more userfriendly(?), but limited when things get complicated.

It's possbile to do the full regex way with creating a template.

Example:

partly html code from a dutch tvguide (TVGids.nl - Zoeken):

<div class="program">
<a href="/programma/9519408/Nederland_in_Beweging%21/">
<span class="time">08:45 - 09:00</span>
<span class="title">Nederland in Beweging!</span>
<span class="channel">Nederland 1</span>
</a>

template could be like this:

TemplateProgram =
<div class="program">[^<]*
<a href="(?<SUBLINK>[^"]*)">[^<]*
<span class="time">(?<START>[^\s]*) - (?<END>[^<]*)</span>[^<]*
<span class="title">(?<TITLE>[^<]*)</span>[^<]*
<span class="channel">(?<CHANNEL>[^<]*)</span>[^<]*
</a>

some sample sourcecode:

Regex ProgramSearch = new Regex(TemplateProgram)

MatchCollection ProgramMatches = ProgramSearch.Matches(HtmlPageText)

foreach(Match ProgramMatch in ProgramMatches)
{
sublink = ProgramMatch.Groups["SUBLINK"].Value;
starttime = ProgramMatch.Groups["START"].Value;
endtime = ProgramMatch.Groups["END"].Value;
title = ProgramMatch.Groups["TITLE"].Value ;
extrafields.Add("CHANNEL", ProgramMatch.Groups["CHANNEL"].Value);
}

gr,
Gijs

arion_p · June 21, 2010

Regular expressions can get quite complex and difficult to debug. As a result it is more error prone and requires more experience to build a working grabber even for simple sites.

Also it is very hard to filter out HTML noise. I would much rather see an XSL/XPath implementation which is more structured. Of course any such implementation should be in addition to what already exists.

benjerry · June 22, 2010

arion_p said:
Regular expressions can get quite complex and difficult to debug. As a result it is more error prone and requires more experience to build a working grabber even for simple sites.

Also it is very hard to filter out HTML noise. I would much rather see an XSL/XPath implementation which is more structured. Of course any such implementation should be in addition to what already exists.

Perhaps could be added like this?

<Template name="default" matchtype="quite_complex_and_difficult_to_debug">
or
<Template name="default" matchtype="Xctra">

My programming knowledge is about 10 years old(but 20 experience), so XSL/XPath are new techniques again for me. I'll do some reading.
Thanks for the pointer.

benjerry · June 22, 2010

I've taken a very small look into it. I guess you means the XSL/Transform + XPath languages.

It can transform an xml document to an other xml document. It uses XPath to localise inside the input.

So this

<div class="program">
<a href="/programma/9519408/Nederland_in_Beweging%21/">
<span class="time">08:45 - 09:00</span>
<span class="title">Nederland in Beweging!</span>
<span class="channel">Nederland 1</span>
</a>

could be transformed to something like this

<site>
<programlist channelid="Nederland 1">
<program>
<title>Nederland in Beweging!</title>
<starttime>08:45</starttime>
<endtime>09:00</endtime>
<sublinks>
<link url="/programma/9519408/Nederland_in_Beweging%21/" stylesheet="sublink_ned1.xsl" />
</sublinks>
</program>
<program>
..
</program>
</programlist>
<programlist channelid="Nederland 2">
<program>
...
</program>
</programlist>
</site>

.. using a XSLT stylesheet.

- multiple channels support on 1 page would be nice.. but I don't see it happening inside current WebEPG.

- sublink(s) would require a separate stylsheet(s). context is inside a single program: <program> .. </program>

- wouldn't it be nice also if the site url building/grabbingcontrol was also defined somewhere in a file instead of fixed in programsource.
with input of date/channelid/grab statistics and using a stylesheet, the next grab url could be generated?

benjerry · July 5, 2010

As an html page could be non conform xhtml, this will be a big problem as a source document.

However, it could be pre-processed by a project called HTML Tidy.

HTML Tidy Project Page

Also found something called Sgmlreader.

SGMLReader - Converting almost any HTML to valid XML - MindTouch Community Portal

benjerry · July 6, 2010

.net xslt parser is stumbling over this piece of html code:

HTML:

<script type="text/javascript">
    var WlWebsiteId = "tvgids.nl";

    
    if (typeof(wlrcmd) == 'undefined') var wlrcmd = '';
    document.write('<scri' + 'pt type="text/javascript" src="http://rc.bt.ilsemedia.nl/Tag/ilsemedia/JS/' + WlWebsiteId + '/Gt.js"><\/scri' + 'pt>');
    
</script>

this part seems to be the problem: '<scri' + 'pt type="text/javascript"

perhaps need an filteroption for script tags.

arion_p · July 11, 2010

Give this a try: Html Agility Pack
I've used it in the past with much better results than trying to tidy up html and then use normal .net xslt parser.

[EPG] [WebEPG] why not just regex (1 Viewer)

benjerry

MP Donator

arion_p

Retired Team Member

benjerry

MP Donator

benjerry

MP Donator

benjerry

MP Donator

benjerry

MP Donator

arion_p

Retired Team Member

Users who are viewing this thread