Filmtipset.se - Swedish grabber (1 Viewer)

RoChess

Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,338
    1,824
    Country flag
    There's a problem with certain older common named movies where the search returns many movies and the current movie is outside the top 10. Example: Speed. For some reason it seems that the scraper script only reads the first 10 movies even if there's many more returned on the search page. I don't know where this limitation comes from but even if I do find out there's still the problem in this case that the movie isn't even found on the first page so the script needs to also search the other pages. The search list is sorted by year and Speed is from 1994.
    The limit of 10 comes from the <loop>'s default values.

    For IMDb+ I modify that to 99-results to improve matching, which is probably maximum value scraper engine supports.

    Modify your line #81 to:

    Code:
    <loop name="search_results_verified" on="search_results_block" limit="99">
    And that should fix it.
     

    vuego

    Documentation Group
  • Team MediaPortal
  • August 5, 2006
    1,584
    744
    Göteborg
    Sweden Sweden
    Country flag
    • Thread starter
    • Moderator
    • #72
    Hey RoChess

    I'm able to get all 25 results by adding the limit value (it was actually on the second loop at line 83) however I'm now struggling with retrieving multiple pages.

    The script is currently retrieving for example
    HTTP:
    https://www.filmtipset.se/hitta?q=speed
    but the movie is found on page
    HTTP:
    https://www.filmtipset.se/hitta?q=speed&p=1
    I know how to use loop to find regex matches on one single web page but I'm not sure it's possible to use to retrieve several different web pages.
    I also experimented with retrieve both pages and add them together but it seems to only work for integers.
    I've tried retrieving the second page to the same name but it will be overwritten instead of merged.

    I'm not sure I need to loop through every page since it would likely stall when searching for "M" for example ;) We would probably get very far by just searching the first 3-5 pages or so I guess.

    Any ideas?
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,338
    1,824
    Country flag
    In that case you have to make multiple <retrieve> statements, so you should put the <retrieve> inside the loop, and break out of the loop when you got a match. Simple <if> statements allows you to check if a value is found, but I forgot the code to break a loop, or just let it finish regardless.

    Another option is to duplicate your logic multiple times manually and wrap it with an <if> statement to check if you got any results back from the previous regular expression parsing.

    IMDb+ is full of those type of statements to account for all the different configuration options that can be adjusted.

    The HTML source probably contains a reference to "find more results on Page #2" which should give you that https://....&p=2 link, and you can make that part of a regexp capture group to act on those results if they are available, so you only navigate through two pages if there are only two pages worth of results, but go through five of them if there are 5-pages.

    There is no if-then-else, but you can do an <if> + <!if> to achieve the same result, or take things one step further the way I do with IMDb+ by relying on boolean temporary values to remember a state to proceed with.
     

    vuego

    Documentation Group
  • Team MediaPortal
  • August 5, 2006
    1,584
    744
    Göteborg
    Sweden Sweden
    Country flag
    • Thread starter
    • Moderator
    • #74
    I'm still struggling with my loops.
    I'm not sure how to put the <retrieve> inside the <loop> since the loop runs on a regex found only on the first page. When looping through all movies on the first page I don't need to retrieve a new page each time.

    I think I'll try duplicating the logic a couple of times instead. It's a good idea to limit the number of pages searched anyway since there might be thousands of pages or so I guess :)

    I will also have a look at the IMDb+ source to see if there's something I might copy.
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,338
    1,824
    Country flag
    What you could do is this:

    Code:
    <retrieve>
    <store what you need in a variable, or process first page completely>
    <if first-page-contains-a-reference-to-next-page-via-regex-match == true>
        <retrieve second page>
        <add what you need to the same variable, or process second page completely>
    </if>
    <if second-page-contains-a-reference-to-third-page-via-regex-match == true>
        <retrieve third page>
        <add what you need to the same variable, or process third page completely>
    </if>
    <process variable if you did not process each page already>
    Processing 3-pages should be enough i would think, based on your example of a movie's proper result not found until page-3.

    You can also re-investigate if you can tweak your search query, and maybe bring in the year to narrow down the results, so that instead of scraping 3-pages you adjust your search Kung-Fu to ensure the proper movie shows on first page, so you never have to scrape more than one page :rolleyes:
     

    vuego

    Documentation Group
  • Team MediaPortal
  • August 5, 2006
    1,584
    744
    Göteborg
    Sweden Sweden
    Country flag
    • Thread starter
    • Moderator
    • #76
    Yea, I wish there would be a better search function. I've tried adding year and IMDb number but it doesn't work.

    I just finished a version grabbing 5 pages. I will try it myself for a couple of days before making a pull request to get it included in the next version of Moving Pictures.
     

    Attachments

    Users Who Are Viewing This Thread (Users: 0, Guests: 1)

    Top Bottom