WebEPG and proper Title-casing in Greek (1 Viewer)

arion_p

Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,367
    1,642
    Athens
    Home Country
    Greece Greece
    Hi,
    I've been using MP for some 6 months now and I love it.
    I live in Greece and lately I tried creating some grabber files for Greek channels to supplement the one included (GR/www_in_gr.xml which i have also enhanced to include descriptions and genres as well as more channels).
    These new grabber files use the websites of the respective broadcasters but unfortunately the sites use all-caps for the program titles. WebEPG tries to title-case those tiltes and uses the standard .NET ToLower() method. Unfortunately .NET does not correctly handle some special cases in Greek and IIRC some other languages. Specifically for Greek lower casing the letter Sigma is context sensitive: it becomes "lower case Sigma Final" if it is at the end of the word but "lower case Sigma Not_Final" otherwise (i.e. in the middle of the word). ToLower() incorrectly always turns it to "lower case Sigma Not_Final". And although the meaning is not altered (as happens in some other languages) it is still plain wrong (imagine if HELLO was title cased as HellO: you can still understand the meaning but it doesn't seem right does it?)
    I could patch this in WebEPG (just replace non_final with final sigma if it is at the end of a word) but since there are special cases in other languages too, perhaps there should be a more structured way to handle this (e.g. an extensible class in Utils to handle special cases of case folding)

    As a side note: I noticed that (almost) all comparisons of program titles, genres and channel names are binary which makes them fast but case and accent sensitive (e.g. if I schedule to record a program "every time" but then the site changes the case of the titles, the program is no longer considered to be the same and it is not recorded)

    Regards,
    Panayotis

    PS: I will post the grabber files once finished for those interested.
     

    James

    Retired Team Member
  • Premium Supporter
  • May 6, 2005
    1,385
    67
    Switzerland
    Hi Panayotis,

    Thanks for your detailed comments. Would you be able to test the following code with your example:


    CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
    TextInfo textInfo = cultureInfo.TextInfo;
    string titlecase = textInfo.ToTitleCase(uppercase);


    This I believe should use the cultural information to perform the title case conversion. If this works then I will add it to WebEPG. If you are not able to test it, can you provide me with the example text.

    Thanks,

    /James
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,367
    1,642
    Athens
    Home Country
    Greece Greece
    Hi james,

    I tried your code this morning but it didn't work. I also remember reading some article about Greek Final Sigma special casing rules in Unicode. According to the article the special casing rules where initialy included in the Unicode draft but later on they where dropped to avoid complexity. However the current version of Unicode includes those rules. Anyway it seems the NET team decided not to implement those rules (regardless of Unicode standards).

    Actually, I think the problem is not with ToTitleCase() but ToLower(). ToTitleCase() takes a lower case string and uppercases the first letter of each word. If you pass an upper case string it returns it unchanged. In WebEPG, when a title with upper case only letters is found it is first turned to lower case via ToLower() and then the result is fed to ToTitleCase(). It is ToLower that fails to properly lower case Greek Sigmas.
    E.g. (hope you can see Greek characters)
    "ΙΣΩΣ" should become "ισως" but ToLower() returns "ισωσ"
    ("Σ" becomes "σ" in the middle of a word but "ς" at the end)

    Along the above notes I have also tried the following code (that didn't work either):
    Code:
    CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
    TextInfo textInfo = cultureInfo.TextInfo;
    
    string titlecase = textInfo.ToTitleCase(textInfo.ToLower(uppercase));

    I also tried specific Culture (both "el" and 1032) and InvariantCulture (shouldn't work anyway)

    The only way I could make it work is using RegEx:

    Code:
    CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
    TextInfo textInfo = cultureInfo.TextInfo;
    Regex re = new Regex("\\u03c3(?=($|\\W))");
    
    string titlecase = textInfo.ToTitleCase(re.Replace(textInfo.ToLower(uppercase), "\u03c2"));

    Thanks,
    Panayotis
     

    James

    Retired Team Member
  • Premium Supporter
  • May 6, 2005
    1,385
    67
    Switzerland
    Thanks for the info and test.

    I was wondering if it is better to leave these titles in upper case?

    I made this system because in many languages all upper case looks bad, but maybe that is not the case in Greek?
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,367
    1,642
    Athens
    Home Country
    Greece Greece
    Hi James,

    Actually it does look bad in Greek too. However, lower casing Greek is really hard, not just because of the final sigma. In fact, in Greek, all uppercase is never accented but mixed case and lower case (almost) always is. And you need a dictionary to know where to put the accent, so we just settle with simple handling of the final sigma.

    Anyway, I believe adding an option in the grabber xml to leave the titles as they are, is a good thought. The option could be per grabber or per template.
     

    James

    Retired Team Member
  • Premium Supporter
  • May 6, 2005
    1,385
    67
    Switzerland
    Hi Panayotis,

    I've modified the actions to support regex, so adding:

    <Modify channel="*" field="TITLE" search="\\u03c3(?=($|\\W))" action="Replace">\u03c2</Modify>

    Should work. See the wiki for more details about where the modify actions are added in the grabber file.

    Cheers,

    /James
     

    arion_p

    Retired Team Member
  • Premium Supporter
  • February 7, 2007
    3,367
    1,642
    Athens
    Home Country
    Greece Greece
    Hi James,

    I am sorry I couldn't reply earlier. I just found some time to get the new version and try it out. The change you made does the job nicely. I hope I have the new grabbers ready pretty soon.

    Thanks,

    Panayotis
     

    Users who are viewing this thread

    Top Bottom