SiteParser: Force a specific charset ?? (1 Viewer)

ScRePt

Portal Pro
August 2, 2010
170
96
Athens
Home Country
Greece Greece
I am trying to parse the MEGA TV - site through OnlineVideos - the easy way: SiteParser.

But I have a problem with the charset. The site is in windows-1253 (greek) but the "Create regex" window is probably in UTF-8 so I am unable to read the content of the page (the greek content anyway)

Is there any way to force the "web data" textbox to be in windows-1253 ?? The site itself declares the charset in the "Content-type" tag.

Thank you for this awsome plugin
 

doskabouter

Development Group
  • Team MediaPortal
  • September 27, 2009
    4,566
    2,938
    Nuenen
    Home Country
    Netherlands Netherlands
    I am trying to parse the MEGA TV - site through OnlineVideos - the easy way: SiteParser.

    But I have a problem with the charset. The site is in windows-1253 (greek) but the "Create regex" window is probably in UTF-8 so I am unable to read the content of the page (the greek content anyway)

    Is there any way to force the "web data" textbox to be in windows-1253 ?? The site itself declares the charset in the "Content-type" tag.

    Thank you for this awsome plugin

    I thought the charset stuff was all fixed in c# :(

    The text is going wrong, because I convert it to rtf so I can apply the coloring. That's where it's going wrong, and I will look into it shortly.

    And as always: It's appreciated if you publish the results of your work, so that others may enjoy it too!
     

    ScRePt

    Portal Pro
    August 2, 2010
    170
    96
    Athens
    Home Country
    Greece Greece
    And as always: It's appreciated if you publish the results of your work, so that others may enjoy it too!

    Of course!!

    That is what I am trying for the last 2 hours ...
    You would make my life easier (no offence) if you supported multiple matches per group ....

    Right now you support only:
    "How many times the regex is spotted in the html" => "For each of the matches take the first value of each group"

    If you had it like below, it would be easier:
    "How many times the regex is spotted in the html" => "For each of the matches, for each of the groups" => "take all the values"

    For example, the following is returning only the first title:
    (?<title>mytitle)+
     

    offbyone

    Development Group
  • Team MediaPortal
  • April 26, 2008
    3,989
    3,712
    Stuttgart
    Home Country
    Germany Germany
    AW: SiteParser: Force a specific charset ??

    Each match should contain multiple named groups, because each match corresponds to a Video, which needs more than just a title (url, decs, thumb). So you have to do the matching the other way around: find a regex that matches n times, with parts of each match named with a predefined group name.
     

    ScRePt

    Portal Pro
    August 2, 2010
    170
    96
    Athens
    Home Country
    Greece Greece
    I understand your concern, but what if someone wants to match *some* videos and not all of them.
    There is no way to tell the parser to start for a point in html and stop in an other point.
    As a result, you end up matching all the videos

    The other way is simple: Start from there, match with +, stop there!! 1 match, multiple videos!
    Maybe support an additional regex for splitting the html ???
     

    doskabouter

    Development Group
  • Team MediaPortal
  • September 27, 2009
    4,566
    2,938
    Nuenen
    Home Country
    Netherlands Netherlands
    I am trying to parse the MEGA TV - site through OnlineVideos - the easy way: SiteParser.

    But I have a problem with the charset. The site is in windows-1253 (greek) but the "Create regex" window is probably in UTF-8 so I am unable to read the content of the page (the greek content anyway)

    Is there any way to force the "web data" textbox to be in windows-1253 ?? The site itself declares the charset in the "Content-type" tag.

    Thank you for this awsome plugin

    Fixed in the current svn. so unless you're able to compile it yourself you should wait till the next release.

    As for your other problem: I stumbled upon it many times, and there are techniques to do this (involving something like non-capturing back-referencing look-ahead/behind blabla something) but I was never able to understand/use them :(.
    But most of the time I was able to work around it with being more specific in my regex (mostly with parts of the url eg (?<url>http:\\website.com\videos[^"]*) )
     

    offbyone

    Development Group
  • Team MediaPortal
  • April 26, 2008
    3,989
    3,712
    Stuttgart
    Home Country
    Germany Germany
    AW: SiteParser: Force a specific charset ??

    You can use Lookahead and Lookbehind regex constructs to skip parts of the html. Some sites already do.
     

    ScRePt

    Portal Pro
    August 2, 2010
    170
    96
    Athens
    Home Country
    Greece Greece
    hehe, those backreferences are what I am reading on right now ... my regex freezed the siteparser :p

    I am telling you, the easiest way of getting around this is
    - either supporting of starting/ending extra regexes of the resulting html
    - or supporting of 1 match with multiple macthes per group.

    Unfortunatelly the site I'm trying to parse has the following:
    <li class=1 ....
    <li ... videoIwant ...
    <li ... videoIwant ...
    <li class=2 ...
    <li videoIdontwant ...
    ...
    <li class=3 ...

    Unless I tell it to match from class1 to class2, all the <li>, I'm stuck :(
    The <li> are the same for all classes. The urls are actually random to match on them :(


    Edit: (my bad for forgetting): Thank you so much for the lighting-speed fix!!!
     

    offbyone

    Development Group
  • Team MediaPortal
  • April 26, 2008
    3,989
    3,712
    Stuttgart
    Home Country
    Germany Germany
    AW: SiteParser: Force a specific charset ??

    We cannot change the default GenericSite behavior now, there are too many sites based of the current version.
    If that behavior is not enough in your case you can always write another util based on the generic that will use the regex differently ;)
     

    ScRePt

    Portal Pro
    August 2, 2010
    170
    96
    Athens
    Home Country
    Greece Greece
    Couldn't you add an optional option to the generic parser like the "forceUTF8Encoding" so that the matcher behaves the described way?
    Implementing a duplicate of the generic parser so that 10 lines are changed (one more loop for parse, categories, subcategories) could lead to future incompatibilities ...
     

    Users who are viewing this thread

    Top Bottom