CSFD scraper script 0.2.3 [CZ] (1 Viewer)

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
This is continuation of this thread.

Czech scraper script for CSFD.cz
Script for great MovingPictures plugin.
  • Movie name could be in original, English or Czech language
  • Title, Aka Titles, Year, Directors, Actors, Genres, Score and Summary are retrieved from CSFD
  • Writers, Certification, Language, Tagline and Runtime are retrieved from IMDb
  • If there are two different Czech titles (separated by "/"), the second title is used as first alternate title
  • Certification is UK with Czech description (e.g. 15 = Přístupné od 15 let)
  • Articles (The, A, An, Ein, Das, Der, Die, El, Les, Un and Une) are moved to the beginning of the original movie names
  • Not all Languages retrieved from IMDb are translated in Czech
  • Writers are retrieved from IMDb so they are without accents
  • Tagline is retrieved from IMDb so it is in English
  • Runtime is retrieved from IMDb but if it wasn't found on IMDb than script attempt to find it on CSFD

Version: 0.2.3
  • Repaired retrieving Directors, Actors and Summary from CSFD

Installation:
  1. Download the .xml file (attachment at bottom of this post).
  2. Open "MediaPortal Configuration", go to the "Plugins", select "Moving Pictures" and "Config".
  3. Select the "Importer Settings" tab.
  4. In the "Data Sources" section select the "Manually manage movie data sources" radio button.
  5. Click the "Movie Details Data Sources" button.
  6. In the popup click the arrow just to the right of the "+" button and pick "Add a New Data Source".
  7. Browse to the .xml scraper file you have downloaded and click OK.
  8. It should automatically update the existing "CSFD.cz" scraper to new version.
 

Attachments

  • CSFD 0.2.3.xml
    13.1 KB

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
Re: CSFD scraper script 0.1.8 [CZ]

Version: 0.1.8
  • Repaired retrieving Title, Aka Titles and Summary from CSFD
  • Article Un is now moved to the beginning of the original movie name
 

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
Re: CSFD scraper script 0.1.9 [CZ]

Added new version to the first post.

Version: 0.1.9
  • Repaired retrieving Genres from CSFD
  • Repaired retrieving Certification and Language from IMDb
 

JiRo

MP Donator
  • Premium Supporter
  • May 1, 2009
    184
    44
    Prague
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.1.9 [CZ] - 100% succes hit (558 movies)


    • 1st of all - Trottel, many :D for your perfect work. But...

      When I have used first time your scraper script, I have reached 40% succesfull hits. It was in excess of former version of scraper, but still poor. My friend has 100% hit, but he uses english names of movie files and IMDB scraper. My target was 100% hit with czech names and CSFD scraper too :D. I have started to read your script and found 1st small problem:

      <set name="rx_search_results_block">
      <![CDATA[
      >v originálních názvech</td>.+</body>
      ]]>
      </set>

      expression ">v českých názvech" causes jump of czech movie names. Therefore I have replaced ">v originálních názvech" by ">v českých názvech". Result was much better then before. But some of czech movies that were succesfull before, were without hit now. Then I read your script more carefully and I have tried test on the CSFD web page. Whereon I found out that some czech movies aren't in ">v českých názvech" section but in ">v originálních názvech" :eek: and Czech section absent.
      Therefore I changed regular expresion part to:

      <set name="rx_search_results_block">
      <![CDATA[
      >v českých názvech</td>.+</body>
      ]]>
      </set>

      <set name="rx_search_results_block2">
      <![CDATA[
      >v originálních názvech</td>.+</body>
      ]]>
      </set>


      and part of code to:

      ...
      <parse name="search_results_block" input="${search_page}" regex="${rx_search_results_block}"/>
      <if test="${search_results_block}=">
      <parse name="search_results_block" input="${search_page}" regex="${rx_search_results_block2}"/>
      </if>

      <if test="${search_results_block}!=">
      <loop name="search_results_verified" on="search_results_block">
      ...

      Last change I did by number of searched movie, from previous 20 to 100. Few movies have serch result list very long...

      ...
      <set name="movie[${counter}].details_url" value="${site}film/${curr_details[0]}"/>
      <subtract name="movie[${counter}].popularity" value1="100" value2="${counter}" />
      </loop>
      ...

      Now I'm satisfied. The target 100% hit is achived! :p and your condition:

      • Movie name should be in original or English language

      can be extended to:

      • Movie name should be in Czech, original or English language

      Maybe we should find out if exist movies with English name only :eek:

      Curretly I have private 0.1.10 version of CSFD scraper :oops:, but official release is up to you. You are author!

      JiRo.
     

    Trottel

    Portal Member
    February 18, 2009
    48
    26
    Liberec
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.1.9 [CZ] - 100% succes hit (558 movies)

    Thank you JiRo,
    now I'm testing your modification and it looks good.
    I found that I need to fix pulling data from IMDb yet (Writers, Language,...). When I solve it I will post new version.

    Last change I did by number of searched movie, from previous 20 to 100. Few movies have serch result list very long...

    ...
    <set name="movie[${counter}].details_url" value="${site}film/${curr_details[0]}"/>
    <subtract name="movie[${counter}].popularity" value1="100" value2="${counter}" />
    </loop>
    ...

    This part of code has nothing to do with number of searched movies. Maybe you thought:
    Code:
    <loop name="curr_details" on="movie_details" [B]limit="20"[/B]>
    But value 100 seems to me unnecessarily high so I changed it to 50.
     

    JiRo

    MP Donator
  • Premium Supporter
  • May 1, 2009
    184
    44
    Prague
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.1.9 [CZ] - 100% succes hit (558 movies)

    This part of code has nothing to do with number of searched movies. Maybe you thought:
    Code:
    <loop name="curr_details" on="movie_details" [B]limit="20"[/B]>
    But value 100 seems to me unnecessarily high so I changed it to 50.

    You are right of course. The mistake was at the description at the forum, my scraper uses right version of changes (much like yours). And value? Yes 50 it could have been sufficient. It's good that you will fix writers and language. It will be more then perfect! :D

    JiRo.
     

    Trottel

    Portal Member
    February 18, 2009
    48
    26
    Liberec
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.1.10 [CZ]

    Added new version to the first post.

    Version: 0.1.10
    • Changed searching movies, so now you can have their names in Czech (thanks to JiRo)
    • Increased limit of searched movies from 20 to 50
    • Repaired retrieving Writers, Certification, Language, Tagline and Runtime from IMDb
     

    Trottel

    Portal Member
    February 18, 2009
    48
    26
    Liberec
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.1.10 [CZ]

    Added new version to the first post.

    Version: 0.2.0
    • Repaired whole script because of changes on CSFD site
     

    JiRo

    MP Donator
  • Premium Supporter
  • May 1, 2009
    184
    44
    Prague
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.1.10 [CZ]

    Added new version to the first post.

    Version: 0.2.0
    • Repaired whole script because of changes on CSFD site

    Hi Trottel,

    :D for your work on 0.2.0 parser version.

    I'm testing this one now, file name parsing looks good (matches new structure of CSFD page), but to result is still long way :(. Configurator.exe takes 100% CPU time, "retriving possible matches" for one movie takes for more than 1 min. on the average. I don't get it... :confused:

    JiRo.
     

    Trottel

    Portal Member
    February 18, 2009
    48
    26
    Liberec
    Home Country
    Czech Republic Czech Republic
    Re: CSFD scraper script 0.2.0 [CZ]

    Hi JiRo,
    at first try to delete CSFD scraper from Movie Details Data Sources on Importer Settings tab and then add it again.
    For me "Retrieving possible matches" takes only 1 or 2 seconds. But "Retrieving details for:..." very depend on size of the page. For movie Wyatt Earp (cca 36 kB) it takes about 20 seconds, for Avatar 2003 (cca 14 kB) 5 seconds, but for Avatar 2009 (cca 65 kB) it is over 2 minutes. CPU usage is about 50% (Intel Core2 Duo E8200).
    I think that there is something wrong in my script in action get_details. I will try to find where is problem.
     

    Users who are viewing this thread

    Top Bottom