CSFD scraper script 0.2.3 [CZ] (2 Viewers)

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
Re: CSFD scraper script 0.2.0 [CZ]

Added new version to the first post.

Version: 0.2.1
  • I rewrite the whole script because of problems with speed of data retrieving
 

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
Re: CSFD scraper script 0.2.0 [CZ]

Added new version to the first post.

Version: 0.2.2
  • Repaired error in searching that causes CPU overloading (thanks JiRo for cooperation)
  • If Runtime wasn't found on IMDb than script attempt to find it on CSFD
 

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
Re: CSFD scraper script 0.2.0 [CZ]

Added new version to the first post.

Version: 0.2.3
  • Repaired retrieving Directors, Actors and Summary from CSFD
 

JiRo

MP Donator
  • Premium Supporter
  • May 1, 2009
    184
    44
    Prague
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.2.0 [CZ]

    Added new version to the first post.

    Version: 0.2.3
    • Repaired retrieving Directors, Actors and Summary from CSFD

    Hi,

    I found that in CSFD scraper something wrong a week ago and was ready to write you. You were faster and as always you did a great job. Thanks.

    JiRo.
     

    acmetelka

    New Member
    November 22, 2011
    2
    0
    Home Country
    Czech Republic Czech Republic
    Hi,

    I have question about this or maybe also others scrapers.

    Is it possible to achieve that movie database in Moving Pictures (using this scraper or some combination) will be as follows?

    I'm from Czech republic, so I have mainly czech or english movies in DB. I would prefer this in database.

    - if movie is czech, title in catalog is in czech only, description is in czech also ( informations from CSFD )
    - if movie is english, title in catalog would be "english original title ( czech alternative title ) or (as enhancement) "czech title (english original title)", description in czech ( if available on CSFD, or english, if movie found only in IMDB) (something similar I found in IMDB+ scryper, but as I understood, this scraper gets all informations only from IMDB, so in english)

    I was thinking about method how to achiveve this: first to look for movie in CSFD, if found and there is also IMDB link, get the english name also. If the movie is czech, set czech title, if english, set title as combination of titles from CSFD and IMDB.

    Would it be something like this possible? Using this scraper, or some combinations of scrapers?

    Thanks a lot,

    Metelka
     

    JiRo

    MP Donator
  • Premium Supporter
  • May 1, 2009
    184
    44
    Prague
    Home Country
    Czech Republic Czech Republic
    Hi Metelka,

    I understand your idea, I would welcome something like that. I thought about it and began to adjust scraper and a few months ago, but after a while I stopped. I have not idea how it should work. Your vision is clearier (it is also easier). Anyway, I answered you question. What you want can be resolved only by scraper adjusting :(. I'll think about your idea.

    Thinking... :p

    And now my first question. What is english movie?

    Seriously, look at it and try something.

    JiRo (Jirka).
     

    Trottel

    Portal Member
    February 18, 2009
    48
    26
    Liberec
    Home Country
    Czech Republic Czech Republic
    I made some changes (and create new scraper CSFD+). Try it, if that's the way you wanted it.
    Forget it, I misunderstood the request :sorry:
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    I feel your pain. When imdb.com website decided out of sheer stupidity that if you are in Germany and are viewing imdb.com that you should get the German translated title (if one is available). You can disable this behaviour if you sign up at imdb.com and adjust your profile settings, but the scraper-script is unable to do that.

    That's when I started writing a system in IMDb+ that would attempt to 'recognize' what makes a title English. And the current method which works 'ok', but is not without mistakes and takes up a very large portion now more then 1/4th of the entire IMDb+ scraper-script. I'm actually expanding it to support more languages (I lost my mind already, so why not :D), but the only proper support I can deal with right now is English, German, French, Spanish, Portugese, Italian, Icelandish, Swedish and Dutch.

    The only advice I can give you for the CSFD scraper-script, is that inside the search-node you have access to the filtered title and even the filename itself. So you can use this not only as search string to pass onto the CSFD website to locate the movie via alternative titles, but you can also use it to compare the results.

    Example File = "Puss in Boots (2011).mkv"

    Search at CSFD with title "Puss in Boots" = search results

    Not sure if that's how you do it in your scraper-script, or if you use some API or other method, but that makes no difference for what I'm trying to explain. So your CSFD scraper-script will eventually turn movie[0] into the details of the following movie: Kocour v botách, and put in the AKA title of "Puss in Boots" as well.

    MovingPictures is then able to match the original filename title with the AKA title and will auto-approve the search results found (and instruct CSFD scraper-script via details-node to get all the info, such as summary, crew, etc). I assume in your CSFD scraper-script that you however used movie[0].title = "Kocour v botách", so that is what MovingPictures will use as title, eventhough it auto-approved via movie[0].alternative_title.

    However you have full control over what movie[0].title becomes inside the search-node, and you could have used movie[0].title = "Puss in Boots" as well which would have given the English title results.

    So what you can do, is inside search-node compare title from filename, which is ${search.title}, to the AKA title results and when you find a match then overrule movie[0].title to become the ${search.title} value.

    Then you get the following results:

    "Puss in Boots (2011).mkv" becomes "Puss in Boots"
    and
    "Kocour v botach (2011).mkv" becomes "Kocour v botách"

    You can then also decide to loose your mind and offer these type of 'options' as configurable options to the user, which is what I did and which is what lead to the IMDb+ plugin. So that only if a user has the say for example 'Force English titles if filename matches' setting enabled the above system gets used. Get your CSFD users to 'star' the following issue otherwise #319 so that it will be easier for you to also support configurable options instead of having to write your own plugin (you are more then welcome to use the source code from IMDb+ plugin project though)

    Remember that you also have to 'fix' the English title again inside details node, and re-use the ${movie.title} value inside the details-node to verify what title to use. At that moment the ${movie.title} is the same as the one you used as movie[0].title in the search-node. You have to then repeat the verification, and verify if the "${movie.title}" value matches the one found at the details page on the 'US flag' shown title. Should be easy to use some regular expression code to retrieve that, because the flag icon is a fixed anchor you can use.

    Infact you can use: <img src="[^"]+" alt="USA" />[^<]+<h3>(?<EnglishTitle>[^<]+)</h3>

    Compounded problem then however is that when a user 'refreshes' an existing movie, you could end up forcing them with an English by mistake. So to prevent that, verify if the actor/writer/director fields are empty first before you do the tricks to the title. At least that is how I solved the problem in IMDb+, if you figure out a better method I would be all ears :cool:
     

    JiRo

    MP Donator
  • Premium Supporter
  • May 1, 2009
    184
    44
    Prague
    Home Country
    Czech Republic Czech Republic
    Hi Trottel,

    0.2.3+ seems to work well. But I think that the original name (what is in parentheses) you should put to the "Alternate title". What is there now is just sick.

    Otherwise, again, good and fast work.

    JiRo
     

    acmetelka

    New Member
    November 22, 2011
    2
    0
    Home Country
    Czech Republic Czech Republic
    Hi guys,

    first question, what is 0.2.3+ version?

    And second: my solution so far..

    Thanks for hints JiRo, I dig a little deeper to the xml, spend some hours to undestand it and made some changes for me.

    Search node I left without changes. In the get_details node, I changed just section for parsing title from CSFD movie detail page. In the page, there is some property og:title for facebook and it seems to me, that there is always czech title / original title ( or only original, if czech not available ). So I parsed this out and then parsed both titles from it. Made one configuration variable in the script, when I choose, what I want in the title ( cz, ori, firstori, firstcz ).

    I don't see the aka titles in MP plugin, so I decided to do it this way.

    Here is the changed script part:

    Code:
    ...
    
          <!-- Retrieve details -->
    
          <set name="movie.details_url" value="${site}${movie.site_id}" />
          <retrieve name="details_page" url="${movie.details_url}" encoding="utf-8" retries="10" timeout_increment="3000" allow_unsafe_header="true" />
    
          
          <!-- Set variable to prefer original name or czech name from CSFD DB values: cz, ori, firstori, firstcz -->
          <set name="pref_title" value="firstori" />
    
          <!-- Regular expressions for parsing og:title property from movie detail html page -->
     
           <set name="rx_og_title">
            <![CDATA[
            <**** property="og:title" content="(.*?)" />
            ]]>
          </set>
    
          <set name="rx_parse_og_title">
            <![CDATA[
            content="(.*?) / (.*?)\(
            ]]>
          </set>
    
           <!-- OG **** property title -->
          <parse name="og_title_all" input="${details_page}" regex="${rx_og_title}" />
          <parse name="title_main" input="${og_title_all}" regex="${rx_parse_og_title}" />
          <parse name="title_ori" input="${title_main[0][1]}" regex="(.+?)(?:, (The|A|An|Ein|El|Das|Die|Der|Les|Un|Une))?[ \t]*$" />
       
           <!-- Accorging to pref_title variable, set movie title -->
           
          <if test="${pref_title}=ori">
            <if test="${title_ori[0][0]}=">
              <set name="movie.title" value="${title_main[0][0]:htmldecode}" />
            </if>
            <if test="${title_ori[0][0]}!=">
              <set name="movie.title" value="${title_ori[0][0]:htmldecode}" />
            </if>
          </if>
         <if test="${pref_title}=cz">
            <set name="movie.title" value="${title_main[0][0]:htmldecode}" />
         </if>
          <if test="${pref_title}=firstori">
            <if test="${title_ori[0][0]}=">
              <set name="movie.title" value="${title_main[0][0]:htmldecode}" />
            </if>
            <if test="${title_ori[0][0]}!=">
              <if test="${title_ori[0][0]}=${title_main[0][0]}">
                <set name="movie.title" value="${title_main[0][0]:htmldecode}" />
              </if>
              <if test="${title_ori[0][0]}!=${title_main[0][0]}">
                <set name="movie.title" value="${title_ori[0][0]:htmldecode} ( ${title_main[0][0]:htmldecode} )" />
              </if>
            </if>
          </if>
          <if test="${pref_title}=firstcz">
            <if test="${title_ori[0][0]}=">
              <set name="movie.title" value="${title_main[0][0]:htmldecode}" />
            </if>
            <if test="${title_ori[0][0]}!=">
              <if test="${title_ori[0][0]}=${title_main[0][0]}">
                <set name="movie.title" value="${title_main[0][0]:htmldecode}" />
              </if>
              <if test="${title_ori[0][0]}!=${title_main[0][0]}">
                <set name="movie.title" value="$${title_main[0][0]:htmldecode} ( ${title_ori[0][0]:htmldecode} )" />
              </if>
            </if>
          </if>
    
    
          <!-- Title  ( original from Trottel, not used) -->
          <!-- 
         <parse name="titleaa" input="${details_page}" regex="&lt;h1&gt;(.+?)(?:, (The|A|An|Ein|El|Das|Der|Die|Les|Un|Une))?(?:\s&lt;span.+?&lt;/span&gt;)?.*?&lt;/h1&gt;" />
          <set name="movie.title" value="${titleaa[0][1]:htmldecode} ${titleaa[0][0]:htmldecode}" />
          <replace name="movie.title" input="${movie.title}" pattern="( \(TV film\))" with="" />
          -->
      
          <!-- Alternate Titles -->
    
    ...

    Attached result in MP.

    Metelka ( Jindrich )
     

    Attachments

    • CSFD_SCRAPER.jpg
      CSFD_SCRAPER.jpg
      470.1 KB

    Users who are viewing this thread

    Top Bottom