[EPG] - [WebEPG] why not just regex

Discussion in 'Improvement Suggestions' started by benjerry, June 20, 2010.

  1. benjerry
    • Premium Supporter

    benjerry MP Donator

    Joined:
    September 26, 2007
    Messages:
    167
    Likes Received:
    10
    Ratings:
    +10 / 0
    Home Country:
    Netherlands Netherlands
    Hi,

    I was wondering why not just use full regex expressions for template matching tv programs and where ever possible.

    Current system is perhaps more userfriendly(?), but limited when things get complicated.

    It's possbile to do the full regex way with creating a template.

    Example:

    partly html code from a dutch tvguide (TVGids.nl - Zoeken):

    <div class="program">
    <a href="/programma/9519408/Nederland_in_Beweging%21/">
    <span class="time">08:45 - 09:00</span>
    <span class="title">Nederland in Beweging!</span>
    <span class="channel">Nederland 1</span>
    </a>

    template could be like this:

    TemplateProgram =
    <div class="program">[^<]*
    <a href="(?<SUBLINK>[^"]*)">[^<]*
    <span class="time">(?<START>[^\s]*) - (?<END>[^<]*)</span>[^<]*
    <span class="title">(?<TITLE>[^<]*)</span>[^<]*
    <span class="channel">(?<CHANNEL>[^<]*)</span>[^<]*
    </a>

    some sample sourcecode:

    Regex ProgramSearch = new Regex(TemplateProgram)



    MatchCollection ProgramMatches = ProgramSearch.Matches(HtmlPageText)

    foreach(Match ProgramMatch in ProgramMatches)
    {
    sublink = ProgramMatch.Groups["SUBLINK"].Value;
    starttime = ProgramMatch.Groups["START"].Value;
    endtime = ProgramMatch.Groups["END"].Value;
    title = ProgramMatch.Groups["TITLE"].Value ;
    extrafields.Add("CHANNEL", ProgramMatch.Groups["CHANNEL"].Value);
    }

    gr,
    Gijs
     
  2. Google AdSense Guest Advertisement



    to hide all adverts.
  3. arion_p
    • Team MediaPortal

    arion_p Retired Team Member

    Joined:
    February 7, 2007
    Messages:
    3,352
    Likes Received:
    1,447
    Occupation:
    Developer
    Location:
    Athens
    Ratings:
    +1,522 / 0
    Home Country:
    Greece Greece
    Show System Specs
    Regular expressions can get quite complex and difficult to debug. As a result it is more error prone and requires more experience to build a working grabber even for simple sites.

    Also it is very hard to filter out HTML noise. I would much rather see an XSL/XPath implementation which is more structured. Of course any such implementation should be in addition to what already exists.
     
  4. benjerry
    • Premium Supporter

    benjerry MP Donator

    Joined:
    September 26, 2007
    Messages:
    167
    Likes Received:
    10
    Ratings:
    +10 / 0
    Home Country:
    Netherlands Netherlands
    Perhaps could be added like this? :D

    <Template name="default" matchtype="quite_complex_and_difficult_to_debug">
    or
    <Template name="default" matchtype="Xctra">

    My programming knowledge is about 10 years old(but 20 experience), so XSL/XPath are new techniques again for me. I'll do some reading.
    Thanks for the pointer. :)
     
  5. benjerry
    • Premium Supporter

    benjerry MP Donator

    Joined:
    September 26, 2007
    Messages:
    167
    Likes Received:
    10
    Ratings:
    +10 / 0
    Home Country:
    Netherlands Netherlands
    I've taken a very small look into it. I guess you means the XSL/Transform + XPath languages.

    It can transform an xml document to an other xml document. It uses XPath to localise inside the input.

    So this

    <div class="program">
    <a href="/programma/9519408/Nederland_in_Beweging%21/">
    <span class="time">08:45 - 09:00</span>
    <span class="title">Nederland in Beweging!</span>
    <span class="channel">Nederland 1</span>
    </a>

    could be transformed to something like this

    <site>
    <programlist channelid="Nederland 1">
    <program>
    <title>Nederland in Beweging!</title>
    <starttime>08:45</starttime>
    <endtime>09:00</endtime>
    <sublinks>
    <link url="/programma/9519408/Nederland_in_Beweging%21/" stylesheet="sublink_ned1.xsl" />
    </sublinks>
    </program>
    <program>
    ..
    </program>
    </programlist>
    <programlist channelid="Nederland 2">
    <program>
    ...
    </program>
    </programlist>
    </site>

    .. using a XSLT stylesheet.

    - multiple channels support on 1 page would be nice.. but I don't see it happening inside current WebEPG.

    - sublink(s) would require a separate stylsheet(s). context is inside a single program: <program> .. </program>

    - wouldn't it be nice also if the site url building/grabbingcontrol was also defined somewhere in a file instead of fixed in programsource.
    with input of date/channelid/grab statistics and using a stylesheet, the next grab url could be generated?
     
  6. benjerry
    • Premium Supporter

    benjerry MP Donator

    Joined:
    September 26, 2007
    Messages:
    167
    Likes Received:
    10
    Ratings:
    +10 / 0
    Home Country:
    Netherlands Netherlands
  7. benjerry
    • Premium Supporter

    benjerry MP Donator

    Joined:
    September 26, 2007
    Messages:
    167
    Likes Received:
    10
    Ratings:
    +10 / 0
    Home Country:
    Netherlands Netherlands
    .net xslt parser is stumbling over this piece of html code:

    HTML:
    1. <script type="text/javascript">
    2.     var WlWebsiteId = "tvgids.nl";
    3.  
    4.    
    5.     if (typeof(wlrcmd) == 'undefined') var wlrcmd = '';
    6.     document.write('<scri' + 'pt type="text/javascript" src="http://rc.bt.ilsemedia.nl/Tag/ilsemedia/JS/' + WlWebsiteId + '/Gt.js"><\/scri' + 'pt>');
    7.    
    this part seems to be the problem: '<scri' + 'pt type="text/javascript"

    perhaps need an filteroption for script tags.
     
  8. arion_p
    • Team MediaPortal

    arion_p Retired Team Member

    Joined:
    February 7, 2007
    Messages:
    3,352
    Likes Received:
    1,447
    Occupation:
    Developer
    Location:
    Athens
    Ratings:
    +1,522 / 0
    Home Country:
    Greece Greece
    Show System Specs
    Give this a try: Html Agility Pack
    I've used it in the past with much better results than trying to tidy up html and then use normal .net xslt parser.
     
    • Like Like x 1
Loading...

Users Viewing Thread (Users: 0, Guests: 0)

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice
  • About The Project

    The vision of the MediaPortal project is to create a free open source media centre application, which supports all advanced media centre functions, and is accessible to all Windows users.

    In reaching this goal we are working every day to make sure our software is one of the best.

             

  • Support MediaPortal!

    The team works very hard to make sure the community is running the best HTPC-software. We give away MediaPortal for free but hosting and software is not for us.

    Care to support our work with a few bucks? We'd really appreciate it!