Help making Scraper (1 Viewer)

Schenk2302

Portal Pro
September 12, 2008
50
14
Bonn
Home Country
Germany Germany
Hi,

i try to make a scraper for cinefacts.de. i´m on my first attemp, so there should be only the search thing in it.

My regex is fine for the search site, but i can't get the movie search to work. it only finds one movie title if i wrote zufaellig verheiratet not for Zufällig verheiratet and the date is always (9999). Maybe one of you guys could take a look in it and try, to tell me what´s wrong and how to setup this.

Here' the code:
Code:
<action name="search">
    
    <set name="offset" value="0" />
    
    <!-- Regular Expressions -->

    <set name="rx_search_results">
      <![CDATA[
      <a href="/kino/(?<movieID>.+)/(?<movieAKA>.+)/filmdetails.html">\s+<b title="(?<movieTitle>.+?)"\s.+\s+\D+(?<movieYear>\d{4})
      ]]>
    </set>

    <!-- Retrieve results using Title -->
    <retrieve name="search_page" url="http://www.cinefacts.de/suche/suche.php?name=${search.title:safe}" />

    <!-- if we got a details page, this is used. if not, regex does not match so we dont process the loop-->
    <parse name="details_page_block" input="${search_page}" regex="${rx_search_results}"/>
        <if test="details_page_block[0][0]!=">
            <loop name="item_return" on="details_page_block">
              <add name="counter" value1="${count}" value2="${offset}" />
                  <set name="movie[${counter}].title" value="${item_return[2]:htmldecode}"/>
                  <set name="movie[${counter}].alternate_titles" value="${item_return[1]:htmldecode}" />
                  <!-- tests the existance of a year before trying to put on in the movie info -->
                  <if test="${item_return[3]}!=">
                      <set name="movie[${counter}].year" value="${item_return[3]:htmldecode}"/>
                  </if>
              <set name="movie[${counter}].site_id" value="${item_return[0]}"/>
              <set name="movie[${counter}].details_url" value="http://www.cinefacts.de/kino/${item_return[0]}/${item_return[1]}/filmdetails.html"/>
                  <subtract name="movie[${counter}].popularity" value1="100" value2="${counter}"/>
            </loop>
        </if>

  </action>
  
</ScriptableScraper>

Muchas gracias

Schenk
 

LRFalk01

Portal Pro
August 27, 2007
257
92
39
Home Country
United States of America United States of America
Try this out:

Code:
<action name="search">
    
    <set name="offset" value="0" />
    
    <!-- Regular Expressions -->

    <set name="rx_search_results">
      <![CDATA[
      <a\shref="/kino/(?<movieID>[\d]+)[^<]+[^>]+>(?<movieTitle>[^<]+)[^\n]+\n[^O]+OT..(?<movieOT>[^<]+)[^\d]+(?<movieYear>\d{4})
      ]]>
    </set>

    <!-- Retrieve results using Title -->
    <retrieve name="search_page" url="http://www.cinefacts.de/suche/suche.php?name=${search.title:safe}" />

    <!-- if we got a details page, this is used. if not, regex does not match so we dont process the loop-->
    <parse name="details_page_block" input="${search_page}" regex="${rx_search_results}"/>
        <if test="details_page_block[0][0]!=">
            <loop name="item_return" on="details_page_block">
                  <add name="counter" value1="${count}" value2="${offset}" />
                  <set name="movie[${counter}].title" value="${item_return[1]:htmldecode}"/>
                  <set name="movie[${counter}].alternate_titles" value="${item_return[2]:htmldecode}" />
                  <!-- tests the existance of a year before trying to put on in the movie info -->
                  <if test="${item_return[3]}!=">
                      <set name="movie[${counter}].year" value="${item_return[3]:htmldecode}"/>
                  </if>
              <set name="movie[${counter}].site_id" value="${item_return[0]}"/>
              <set name="movie[${counter}].details_url" value="http://www.cinefacts.de/kino/${item_return[0]}/${item_return[1]}/filmdetails.html"/>
                  <subtract name="movie[${counter}].popularity" value1="100" value2="${counter}"/>
            </loop>
        </if>

  </action>
  
</ScriptableScraper>

Try not to use .+ or .* That is at least something that I try to avoid.

-LRFalk01
 

Schenk2302

Portal Pro
September 12, 2008
50
14
Bonn
Home Country
Germany Germany
Hi LRFalk01,

thanky you very much for your help, your my new hero. This is working now but i got a little problem now, which i don't know to change.

Code:
    <set name="rx_search_results">
      <![CDATA[
      <a\shref="/kino/(?<movieID>[\d]+)[^<]+[^>]+>(?<movieTitle>[^<]+)[^\n]+\n[^O]+OT..(?<movieOT>[^<]+)[^\d]+(?<movieYear>\d{4})
      ]]>
    </set>

    <!-- Retrieve results using Title -->
    <retrieve name="search_page" url="http://www.cinefacts.de/suche/suche.php?name=${search.title:safe}" />

    <!-- if we got a details page, this is used. if not, regex does not match so we dont process the loop-->
    <parse name="details_page_block" input="${search_page}" regex="${rx_search_results}"/>
        <if test="details_page_block[0][0]!=">
            <loop name="item_return" on="details_page_block">
                  <add name="counter" value1="${count}" value2="${offset}" />
                  <set name="movie[${counter}].title" value="${item_return[1]:htmldecode}"/>
                  <set name="movie[${counter}].alternate_titles" value="${item_return[2]:htmldecode}" />
                  <!-- tests the existance of a year before trying to put on in the movie info -->
                  <if test="${item_return[3]}!=">
                      <set name="movie[${counter}].year" value="${item_return[3]:htmldecode}"/>
                  </if>
              <set name="movie[${counter}].site_id" value="${item_return[0]}"/>
              <set name="movie[${counter}].details_url" value="http://www.cinefacts.de/kino/${item_return[0]}/[COLOR="Red"]${item_return[1]}/[/COLOR]filmdetails.html"/>

The red is filled thru the regex with for example Zeiten des Aufruhrs, but the link needs zeiten_des_aufruhrs !!!

Don't know how to set this up because i need this now for the <action>get details retrieve url.

Maybe you could help me out there too.

Thanks so much again

Schenk
 

JoeSmith

Portal Pro
November 17, 2007
314
44
Home Country
Germany Germany
since you guys are so good in writing scripts, is there a way to get the tt numbers of the top 250 movies on imdb ?
it would be pretty great if these could be grabbed and compared to the movies you have in your database and MovingPictures could add a new filter like "show movies that are on the top 250 list"

Thanks
Joe
 

LRFalk01

Portal Pro
August 27, 2007
257
92
39
Home Country
United States of America United States of America
Schenk2302,

It does not matter. The site redirects you to the correct address.

JoeSmith - Anything is possible. From the site you linked, the following RegEx will get you tt#, movie Title, and year from the pages source.

Code:
href="/title/tt(?<IMDBID>[\d]+)/">(?<movieTitle>[^<]+)</a>\s\((?<movieYear>\d{4})

-LRFalk01
 

Schenk2302

Portal Pro
September 12, 2008
50
14
Bonn
Home Country
Germany Germany
Hi LRFalk01,

it did matter but i solved it anyway. Thank you so much for your help, i really appreciate this and got my script ready yesterday. It' working quite good for my first attempt, so i'm really proud about it.

Thanks again

Schenk
 

Schenk2302

Portal Pro
September 12, 2008
50
14
Bonn
Home Country
Germany Germany
Hi LRFalk01,

sorry, but could you help me with this one:

Code:
<li class="c1">“Everything dies, baby, that’s a fact. But maybe everything that dies. Some day comes back.” (Bruce Springsteen, „Atlantic City“)<br />
<br />
Randy „The Ram“ Robinson (Mickey Rourke) ist ein Gladiator des Pop-Zeitalters. Als Wrestler (Catcher) feierten ihn fr&uuml;her die Fans in ganz Amerika. Doch der Preis dieses Ruhmes war hoch: Der Star von einst ist ein Wrack, er h&auml;lt sich mit Billigk&auml;mpfen f&uuml;r seine letzten, unverbesserlichen Anh&auml;nger &uuml;ber Wasser.</li>

As you can see, some movie summaries have a break in them and the script now only recognize the part till the first <br/>.

if i change the regex for the whole text, the movies with no break in them, recieve no summary.

How to make the script, recieve summary with break and without too???

I hope you understand and could answer me how to fix that .

Thanks in advance
 

LRFalk01

Portal Pro
August 27, 2007
257
92
39
Home Country
United States of America United States of America
This may or may not work (it uses a .+ which i hate to use).
Kurzinhalt</h2></li>[^>]+>(?<GroupName>.+)</li>

You would then have to use a variable modifier in the scraper engine to remove the unwanted html tags.
Scraper Engine - Moving Pictures

Code:
<parse name="summary" input="${details_page}" regex="${rx_description}"/>
        <if test="${summary[0][0]!=">
            <set name="summary_clean" value="${summary[0][0]:striptags}" />
            <set name="movie.summary" value="${summary_clean:htmldecode}" />
        </if>
 

Users who are viewing this thread

Top Bottom