Parsing Strings (regexp?) (1 Viewer)

jameson_uk

Retired Team Member
  • Premium Supporter
  • January 27, 2005
    7,258
    2,528
    Birmingham
    Home Country
    United Kingdom United Kingdom
    I am guessing there is no way to parse the text strings picked up to get other data?

    eg. the UK radio times grabber is setup as
    Code:
      <Listing type="Data">
        <Site url="http://xmltv.radiotimes.com/xmltv/[ID].dat" post="" external="false" encoding="" />
        <Data rowDelimitor="
    " dataDelimitor="~">#TITLE~~#SUBTITLE~~~#ACTORS~~~#REPEAT~#SUBTITLES~~~~~~~#GENRE~#DESCRIPTION~~#DATE~#START~#END~
    </Data>
      </Listing>
    Two random rows are
    Code:
    FlashForward~~5/22 - Give Me Some Truth~~~Joseph Fiennes,John Cho,Sonya Walger,Jack Davenport,Genevieve Cortese,Cynthia Addai-Robinson~false~false~false~true~false~false~false~false~~~Drama~Sci-fi drama about a mysterious event that causes the population of the entire world to black out simultaneously. Mark is questioned about his flashforward during a Senate Intelligence Committee hearing. Elsewhere, Janis is forced to examine the future of her current romantic relationship.~false~02/11/2009~20:00~21:00~60
    
    
    Sex and the City~~Out of the Frying Pan~~~Sarah Jessica Parker,Cynthia Nixon,Kim Cattrall,Kristin Davis~false~false~false~true~false~false~false~false~~~Sitcom~Sitcom about a thirtysomething writer who draws on her experience of the New York singles scene for her society column. Charlotte runs to escape the pain of not having a child. Miranda emphasises Brady's needs, and Samantha gets a new hairstyle.~false~15/11/2009~23:00~23:45~45

    The first has the episode / series embeded in the #SUBTITLE. This would be simple enough to strip off with a regular expression but I am guessing this is not possible (yet?) with the WebEPG grabber?
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,373
    1,626
    Athens
    Home Country
    Greece Greece
    Unfortunately WebEPG does not support searches for Data and XML parsers. Only HTML parser supports searches.
    Sorry.
     

    jameson_uk

    Retired Team Member
  • Premium Supporter
  • January 27, 2005
    7,258
    2,528
    Birmingham
    Home Country
    United Kingdom United Kingdom
    ok thanks. been reading MediaPortal_WebEPG_Grabber - MediaPortal Manual Documentation as the second UK grabber is HTML but I am not 100% sure I get it....

    Code:
    <Search match="\([0-9]{1,3}[,][0-9]{0,3}\)" field="#EPISODE" remove="true" />
    
    <Search match="\([0-9]{1,3}\)" field="#EPISODE" remove="true" />
    
    <Search match="\([0-9]{1,3}[/][0-9]{0,3}\)" field="#EPISODE" remove="true" />
    Surely across all three these will just add any number in the text to #EPISODE ??

    Is there no way to use back references so if you have
    Series 1 Episode 5
    in the HTML I can extract just relevant numbers?

    ie = match="Series\s([0-9]{1,3})" and just use the sub-expression in backets?
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,373
    1,626
    Athens
    Home Country
    Greece Greece
    <Search> is used to extract field values (e.g. #EPISODE) from raw html. If you want to modify already extracted field values use <Modify>:
    Code:
    [B]<Modify channel="" field="" search="" action="">value</Modify>[/B]

    IIRC value may contain regex references, but have to verify that.
     

    jameson_uk

    Retired Team Member
  • Premium Supporter
  • January 27, 2005
    7,258
    2,528
    Birmingham
    Home Country
    United Kingdom United Kingdom
    <Search> is used to extract field values (e.g. #EPISODE) from raw html. If you want to modify already extracted field values use <Modify>:
    Code:
    [B]<Modify channel="" field="" search="" action="">value</Modify>[/B]

    IIRC value may contain regex references, but have to verify that.

    So if I have say
    Series 5 Episode 6
    in the description I could so something like
    <Search match="Episode [0-9]{1,3}" field="#EPISODE" remove="true" />
    in the searches and then use
    <Modify channel="*" field="#EPISODE" search="Episode\s" action="REPLACE"></Modify>
    in actions

    This would then leave the #EPISODE field as just the number ???

    Checking the logs (these should be really in their own or epg.log rather than tv.log btw) I am guessing that the database fields are loaded as each row is parsed? If that is the case would the modify action take place after everything is loaded and go back and update the database ? (not at home to check but is the series field in the database numeric or VARCHAR??)
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,373
    1,626
    Athens
    Home Country
    Greece Greece
    This should work, but you don't need to replace "Episode " with an empty string, you can just remove it, as in:
    Code:
    <Modify channel="*" field="#EPISODE" search="Episode\s" action="Remove"></Modify>
    You could easily get the same result using the following:
    Code:
    <Search match="(?<=Episode\s)[0-9]{1,3}" field="#EPISODE" remove="false" />
    <Search match="Episode\s[0-9]{1,3}" remove="true" />
    The first search will find the episode number and store only the number in #EPISODE (note the use of the "positive lookbehind grouping construct"). The second search just removes the entire "Episode nnn" string, so it does not appear in other fields. If you use only the search with remove="true", it would only remove the episode number but not the string "Episode ".

    In general the order of operations for each match of the template:

    1. all searches are applied
    2. field values are extracted based on the template
    3. sublinks are scanned and field values are extracted
    4. "modify" actions are applied on the extracted fields
    5. after all data for a channel has been processed it is stored to the database or saved to tvguide.xml
     

    jameson_uk

    Retired Team Member
  • Premium Supporter
  • January 27, 2005
    7,258
    2,528
    Birmingham
    Home Country
    United Kingdom United Kingdom
    This should work, but you don't need to replace "Episode " with an empty string, you can just remove it, as in:
    Code:
    <Modify channel="*" field="#EPISODE" search="Episode\s" action="Remove"></Modify>
    You could easily get the same result using the following:
    Code:
    <Search match="(?<=Episode\s)[0-9]{1,3}" field="#EPISODE" remove="false" />
    <Search match="Episode\s[0-9]{1,3}" remove="true" />
    The first search will find the episode number and store only the number in #EPISODE (note the use of the "positive lookbehind grouping construct"). The second search just removes the entire "Episode nnn" string, so it does not appear in other fields. If you use only the search with remove="true", it would only remove the episode number but not the string "Episode ".

    In general the order of operations for each match of the template:

    1. all searches are applied
    2. field values are extracted based on the template
    3. sublinks are scanned and field values are extracted
    4. "modify" actions are applied on the extracted fields
    5. after all data for a channel has been processed it is stored to the database or saved to tvguide.xml

    I have been stuck with poor regexp functionality in the languages I use at work and has been a long time since I used regexps in .NET... will have a play with this next time I get a chance.

    Thanks :)
     

    jameson_uk

    Retired Team Member
  • Premium Supporter
  • January 27, 2005
    7,258
    2,528
    Birmingham
    Home Country
    United Kingdom United Kingdom
    Code:
    <Search match="(?<=Episode\s)[0-9]{1,3}" field="#EPISODE" remove="false" />
    got errors with this not liking the character < inside an attribute

    will have to play with this somemore as I have a few other issues I need to sort first
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,373
    1,626
    Athens
    Home Country
    Greece Greece
    :oops:, characters <, & and > should be escaped. Use
    &lt; instead of <
    &gt; instead of >
    &amp; instead of &

    So the code snippet should have been:
    Code:
    <Search match="(?&lt;=Episode\s)[0-9]{1,3}" field="#EPISODE" remove="false" />
    Sorry I missed that.
     

    jameson_uk

    Retired Team Member
  • Premium Supporter
  • January 27, 2005
    7,258
    2,528
    Birmingham
    Home Country
    United Kingdom United Kingdom
    Code:
    <Search match="(?&lt;=Episode\s)[0-9]{1,3}" field="#EPISODE" remove="false" />
    Sorry I missed that.
    It is me... It has been so long since I have done this sort of stuff. It is obvious when you think about it....
     

    Users who are viewing this thread

    Top Bottom