[EPG] [WebEPG] why not just regex (1 Viewer)

benjerry

MP Donator
  • Premium Supporter
  • September 26, 2007
    167
    10
    Home Country
    Netherlands Netherlands
    Hi,

    I was wondering why not just use full regex expressions for template matching tv programs and where ever possible.

    Current system is perhaps more userfriendly(?), but limited when things get complicated.

    It's possbile to do the full regex way with creating a template.

    Example:

    partly html code from a dutch tvguide (TVGids.nl - Zoeken):

    <div class="program">
    <a href="/programma/9519408/Nederland_in_Beweging%21/">
    <span class="time">08:45 - 09:00</span>
    <span class="title">Nederland in Beweging!</span>
    <span class="channel">Nederland 1</span>
    </a>

    template could be like this:

    TemplateProgram =
    <div class="program">[^<]*
    <a href="(?<SUBLINK>[^"]*)">[^<]*
    <span class="time">(?<START>[^\s]*) - (?<END>[^<]*)</span>[^<]*
    <span class="title">(?<TITLE>[^<]*)</span>[^<]*
    <span class="channel">(?<CHANNEL>[^<]*)</span>[^<]*
    </a>

    some sample sourcecode:

    Regex ProgramSearch = new Regex(TemplateProgram)

    MatchCollection ProgramMatches = ProgramSearch.Matches(HtmlPageText)

    foreach(Match ProgramMatch in ProgramMatches)
    {
    sublink = ProgramMatch.Groups["SUBLINK"].Value;
    starttime = ProgramMatch.Groups["START"].Value;
    endtime = ProgramMatch.Groups["END"].Value;
    title = ProgramMatch.Groups["TITLE"].Value ;
    extrafields.Add("CHANNEL", ProgramMatch.Groups["CHANNEL"].Value);
    }

    gr,
    Gijs
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,373
    1,626
    Athens
    Home Country
    Greece Greece
    Regular expressions can get quite complex and difficult to debug. As a result it is more error prone and requires more experience to build a working grabber even for simple sites.

    Also it is very hard to filter out HTML noise. I would much rather see an XSL/XPath implementation which is more structured. Of course any such implementation should be in addition to what already exists.
     

    benjerry

    MP Donator
  • Premium Supporter
  • September 26, 2007
    167
    10
    Home Country
    Netherlands Netherlands
    Regular expressions can get quite complex and difficult to debug. As a result it is more error prone and requires more experience to build a working grabber even for simple sites.

    Also it is very hard to filter out HTML noise. I would much rather see an XSL/XPath implementation which is more structured. Of course any such implementation should be in addition to what already exists.

    Perhaps could be added like this? :D

    <Template name="default" matchtype="quite_complex_and_difficult_to_debug">
    or
    <Template name="default" matchtype="Xctra">

    My programming knowledge is about 10 years old(but 20 experience), so XSL/XPath are new techniques again for me. I'll do some reading.
    Thanks for the pointer. :)
     

    benjerry

    MP Donator
  • Premium Supporter
  • September 26, 2007
    167
    10
    Home Country
    Netherlands Netherlands
    I've taken a very small look into it. I guess you means the XSL/Transform + XPath languages.

    It can transform an xml document to an other xml document. It uses XPath to localise inside the input.

    So this

    <div class="program">
    <a href="/programma/9519408/Nederland_in_Beweging%21/">
    <span class="time">08:45 - 09:00</span>
    <span class="title">Nederland in Beweging!</span>
    <span class="channel">Nederland 1</span>
    </a>

    could be transformed to something like this

    <site>
    <programlist channelid="Nederland 1">
    <program>
    <title>Nederland in Beweging!</title>
    <starttime>08:45</starttime>
    <endtime>09:00</endtime>
    <sublinks>
    <link url="/programma/9519408/Nederland_in_Beweging%21/" stylesheet="sublink_ned1.xsl" />
    </sublinks>
    </program>
    <program>
    ..
    </program>
    </programlist>
    <programlist channelid="Nederland 2">
    <program>
    ...
    </program>
    </programlist>
    </site>

    .. using a XSLT stylesheet.

    - multiple channels support on 1 page would be nice.. but I don't see it happening inside current WebEPG.

    - sublink(s) would require a separate stylsheet(s). context is inside a single program: <program> .. </program>

    - wouldn't it be nice also if the site url building/grabbingcontrol was also defined somewhere in a file instead of fixed in programsource.
    with input of date/channelid/grab statistics and using a stylesheet, the next grab url could be generated?
     

    benjerry

    MP Donator
  • Premium Supporter
  • September 26, 2007
    167
    10
    Home Country
    Netherlands Netherlands
    .net xslt parser is stumbling over this piece of html code:

    HTML:
    <script type="text/javascript">
        var WlWebsiteId = "tvgids.nl";
    
        
        if (typeof(wlrcmd) == 'undefined') var wlrcmd = '';
        document.write('<scri' + 'pt type="text/javascript" src="http://rc.bt.ilsemedia.nl/Tag/ilsemedia/JS/' + WlWebsiteId + '/Gt.js"><\/scri' + 'pt>');
        
    </script>

    this part seems to be the problem: '<scri' + 'pt type="text/javascript"

    perhaps need an filteroption for script tags.
     

    Users who are viewing this thread

    Top Bottom