Help making Scraper (1 Viewer)

Schenk2302 · March 3, 2009

Hi,

i try to make a scraper for cinefacts.de. i´m on my first attemp, so there should be only the search thing in it.

My regex is fine for the search site, but i can't get the movie search to work. it only finds one movie title if i wrote zufaellig verheiratet not for Zufällig verheiratet and the date is always (9999). Maybe one of you guys could take a look in it and try, to tell me what´s wrong and how to setup this.

Here' the code:

Code:

<action name="search">
    
    <set name="offset" value="0" />
    
    <!-- Regular Expressions -->

    <set name="rx_search_results">
      <![CDATA[
      <a href="/kino/(?<movieID>.+)/(?<movieAKA>.+)/filmdetails.html">\s+<b title="(?<movieTitle>.+?)"\s.+\s+\D+(?<movieYear>\d{4})
      ]]>
    </set>

    <!-- Retrieve results using Title -->
    <retrieve name="search_page" url="http://www.cinefacts.de/suche/suche.php?name=${search.title:safe}" />

    <!-- if we got a details page, this is used. if not, regex does not match so we dont process the loop-->
    <parse name="details_page_block" input="${search_page}" regex="${rx_search_results}"/>
        <if test="details_page_block[0][0]!=">
            <loop name="item_return" on="details_page_block">
              <add name="counter" value1="${count}" value2="${offset}" />
                  <set name="movie[${counter}].title" value="${item_return[2]:htmldecode}"/>
                  <set name="movie[${counter}].alternate_titles" value="${item_return[1]:htmldecode}" />
                  <!-- tests the existance of a year before trying to put on in the movie info -->
                  <if test="${item_return[3]}!=">
                      <set name="movie[${counter}].year" value="${item_return[3]:htmldecode}"/>
                  </if>
              <set name="movie[${counter}].site_id" value="${item_return[0]}"/>
              <set name="movie[${counter}].details_url" value="http://www.cinefacts.de/kino/${item_return[0]}/${item_return[1]}/filmdetails.html"/>
                  <subtract name="movie[${counter}].popularity" value1="100" value2="${counter}"/>
            </loop>
        </if>

  </action>
  
</ScriptableScraper>

Muchas gracias

Schenk

LRFalk01 · March 4, 2009

Try this out:

Code:

<action name="search">
    
    <set name="offset" value="0" />
    
    <!-- Regular Expressions -->

    <set name="rx_search_results">
      <![CDATA[
      <a\shref="/kino/(?<movieID>[\d]+)[^<]+[^>]+>(?<movieTitle>[^<]+)[^\n]+\n[^O]+OT..(?<movieOT>[^<]+)[^\d]+(?<movieYear>\d{4})
      ]]>
    </set>

    <!-- Retrieve results using Title -->
    <retrieve name="search_page" url="http://www.cinefacts.de/suche/suche.php?name=${search.title:safe}" />

    <!-- if we got a details page, this is used. if not, regex does not match so we dont process the loop-->
    <parse name="details_page_block" input="${search_page}" regex="${rx_search_results}"/>
        <if test="details_page_block[0][0]!=">
            <loop name="item_return" on="details_page_block">
                  <add name="counter" value1="${count}" value2="${offset}" />
                  <set name="movie[${counter}].title" value="${item_return[1]:htmldecode}"/>
                  <set name="movie[${counter}].alternate_titles" value="${item_return[2]:htmldecode}" />
                  <!-- tests the existance of a year before trying to put on in the movie info -->
                  <if test="${item_return[3]}!=">
                      <set name="movie[${counter}].year" value="${item_return[3]:htmldecode}"/>
                  </if>
              <set name="movie[${counter}].site_id" value="${item_return[0]}"/>
              <set name="movie[${counter}].details_url" value="http://www.cinefacts.de/kino/${item_return[0]}/${item_return[1]}/filmdetails.html"/>
                  <subtract name="movie[${counter}].popularity" value1="100" value2="${counter}"/>
            </loop>
        </if>

  </action>
  
</ScriptableScraper>

Try not to use .+ or .* That is at least something that I try to avoid.

-LRFalk01

Schenk2302 · March 4, 2009

Hi LRFalk01,

thanky you very much for your help, your my new hero. This is working now but i got a little problem now, which i don't know to change.

Code:

    <set name="rx_search_results">
      <![CDATA[
      <a\shref="/kino/(?<movieID>[\d]+)[^<]+[^>]+>(?<movieTitle>[^<]+)[^\n]+\n[^O]+OT..(?<movieOT>[^<]+)[^\d]+(?<movieYear>\d{4})
      ]]>
    </set>

    <!-- Retrieve results using Title -->
    <retrieve name="search_page" url="http://www.cinefacts.de/suche/suche.php?name=${search.title:safe}" />

    <!-- if we got a details page, this is used. if not, regex does not match so we dont process the loop-->
    <parse name="details_page_block" input="${search_page}" regex="${rx_search_results}"/>
        <if test="details_page_block[0][0]!=">
            <loop name="item_return" on="details_page_block">
                  <add name="counter" value1="${count}" value2="${offset}" />
                  <set name="movie[${counter}].title" value="${item_return[1]:htmldecode}"/>
                  <set name="movie[${counter}].alternate_titles" value="${item_return[2]:htmldecode}" />
                  <!-- tests the existance of a year before trying to put on in the movie info -->
                  <if test="${item_return[3]}!=">
                      <set name="movie[${counter}].year" value="${item_return[3]:htmldecode}"/>
                  </if>
              <set name="movie[${counter}].site_id" value="${item_return[0]}"/>
              <set name="movie[${counter}].details_url" value="http://www.cinefacts.de/kino/${item_return[0]}/[COLOR="Red"]${item_return[1]}/[/COLOR]filmdetails.html"/>

The red is filled thru the regex with for example Zeiten des Aufruhrs, but the link needs zeiten_des_aufruhrs !!!

Don't know how to set this up because i need this now for the <action>get details retrieve url.

Maybe you could help me out there too.

Thanks so much again

Schenk

JoeSmith · March 4, 2009

since you guys are so good in writing scripts, is there a way to get the tt numbers of the top 250 movies on imdb ?
it would be pretty great if these could be grabbed and compared to the movies you have in your database and MovingPictures could add a new filter like "show movies that are on the top 250 list"

Thanks
Joe

LRFalk01 · March 4, 2009

Schenk2302,

It does not matter. The site redirects you to the correct address.

JoeSmith - Anything is possible. From the site you linked, the following RegEx will get you tt#, movie Title, and year from the pages source.

Code:

href="/title/tt(?<IMDBID>[\d]+)/">(?<movieTitle>[^<]+)</a>\s\((?<movieYear>\d{4})

-LRFalk01

Schenk2302 · March 5, 2009

Hi LRFalk01,

it did matter but i solved it anyway. Thank you so much for your help, i really appreciate this and got my script ready yesterday. It' working quite good for my first attempt, so i'm really proud about it.

Thanks again

Schenk

Schenk2302 · March 5, 2009

Hi LRFalk01,

sorry, but could you help me with this one:

Code:

<li class="c1">“Everything dies, baby, that’s a fact. But maybe everything that dies. Some day comes back.” (Bruce Springsteen, „Atlantic City“)<br />
<br />
Randy „The Ram“ Robinson (Mickey Rourke) ist ein Gladiator des Pop-Zeitalters. Als Wrestler (Catcher) feierten ihn fr&uuml;her die Fans in ganz Amerika. Doch der Preis dieses Ruhmes war hoch: Der Star von einst ist ein Wrack, er h&auml;lt sich mit Billigk&auml;mpfen f&uuml;r seine letzten, unverbesserlichen Anh&auml;nger &uuml;ber Wasser.</li>

As you can see, some movie summaries have a break in them and the script now only recognize the part till the first <br/>.

if i change the regex for the whole text, the movies with no break in them, recieve no summary.

How to make the script, recieve summary with break and without too???

I hope you understand and could answer me how to fix that .

Thanks in advance

LRFalk01 · March 5, 2009

Can you please link me the site you are talking about?

-LRFalk01

Schenk2302 · March 5, 2009

LRFalk01 said:
Can you please link me the site you are talking about?

-LRFalk01

Here you go:

The Wrestler - Ruhm. Liebe. Schmerz. | Kino | Cinefacts.de

LRFalk01 · March 5, 2009

This may or may not work (it uses a .+ which i hate to use).
Kurzinhalt</h2></li>[^>]+>(?<GroupName>.+)</li>

You would then have to use a variable modifier in the scraper engine to remove the unwanted html tags.
Scraper Engine - Moving Pictures

Code:

<parse name="summary" input="${details_page}" regex="${rx_description}"/>
        <if test="${summary[0][0]!=">
            <set name="summary_clean" value="${summary[0][0]:striptags}" />
            <set name="movie.summary" value="${summary_clean:htmldecode}" />
        </if>

Help making Scraper (1 Viewer)

Schenk2302

Portal Pro

LRFalk01

Portal Pro

Schenk2302

Portal Pro

JoeSmith

Portal Pro

LRFalk01

Portal Pro

Schenk2302

Portal Pro

Schenk2302

Portal Pro

LRFalk01

Portal Pro

Schenk2302

Portal Pro

LRFalk01

Portal Pro

Users who are viewing this thread