Scraper setting retrieve variable to empty (1 Viewer)

Merlyn · December 16, 2012

I've got some issues with the <retrieve /> node in my scraper. Hopefully someone can help...

So, the code I have is this:

Code:

            <!-- OFDB Details -->
            <retrieve name="ofdb_details_page" url="http://www.ofdb.de/film/${movie.ofdb_id},${ofdb_movie_url[0][1]}"/>
            <set name="rx_TitelDE">
                <![CDATA[
                    (?:<title>OFDb\s-\s)(?<Titel>.*?)\s\((?<Jahr>\d{4})\)(?=</title>)
                ]]>
            </set>

and it results in

Code:

16-Dec-2012 15:44:55 Debug [        ScraperNode]: executing retrieve: <retrieve name="ofdb_details_page" url="http://www.ofdb.de/film/${movie.ofdb_id},${ofdb_movie_url[0][1]}" />
16-Dec-2012 15:44:55 Debug [        ScraperNode]: Retrieving URL: http://www.ofdb.de/film/1067,Beverly-Hills-Cop-II
16-Dec-2012 15:44:55 Debug [          WebGrabber]: GetResponse: URL=http://www.ofdb.de/film/1067,Beverly-Hills-Cop-II, UserAgent=Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11, CookieHeader=ofdb_theme=0; ofdb_ret=view.php%253Fpage%253Dstart, Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
16-Dec-2012 15:44:55 Debug [        ScraperNode]: Assigned variable: urn://scraper/header/www.ofdb.de = ofdb_theme=0; ofdb_ret=view.php%253Fpage%253Dstart
16-Dec-2012 15:44:55 Debug [          WebGrabber]: GetString: Encoding=
[COLOR=#ff0000][B]16-Dec-2012 15:44:55 Debug [        ScraperNode]: Assigned variable: ofdb_details_page = [/B][/COLOR]
16-Dec-2012 15:44:55 Debug [        ScraperNode]: Assigned variable: rx_TitelDE = (?:<title>OFDb\s-\s)(?<Titel>.*?)\s\((?<Jahr>\d{4})\)(?=</title>)

So, the problem obviously is the red marked line. The variable is cleared, right after the retrieve is done.
It should not be there and I cannot figure out where this is coming from.
The url I try to retrieve exists and I can copy it and open in IE or Firefox without any problem. Does anyone have any idea, what might be the cause?
No other retrieve in the scraper does this.
Can anyone help? @fforde or @RoChess maybe?

Merlyn · December 16, 2012

added the option "allow-unsafe-headers" to the retrieve, that seems to have fixed the problem.

RoChess · December 16, 2012

I'll be buggered, that fixed an issue I had with Icelandic language myself in IMDb+.

It helps having a fresh pair of eyes on an issue sometimes, as for some reason it did not dawn on me at all that there might be new MovPic scraper-script attributes that I should look at using.

Merlyn · December 16, 2012

Haha, glad it helps you.
Unfortunately it was not the solution for me after all, cause it stopped working again.
Reducing the amount of active threads seems to help, though, but there are still some movies, where I cant get ofdb.de to provide the details page.
I'm working on alternatives now...

RoChess · December 16, 2012

Merlyn said:
Haha, glad it helps you.
Unfortunately it was not the solution for me after all, cause it stopped working again.
Reducing the amount of active threads seems to help, though, but there are still some movies, where I cant get ofdb.de to provide the details page.
I'm working on alternatives now...

When I manually loaded the link you provided, it was all loading extremly slow.

Perhaps you need to extend the timeout values?

Merlyn · December 16, 2012

Dunno, its loading very fast for me. I'll give it another try next week. Been playing around with that all day...

RoChess · December 16, 2012

The unsafe header turned out not to be the fix for me. Confusing, because the actual fix (site owner changing their encoding settings) was done at the same time I decided to try your fix. I'm still happy, because I only care for the end result.

Been working on all these scraper issues the entire week, fixing regexp is easy, but encoding has turned out to be the real pain.

There are just so many factors involved, HTTP header has to be correct, HTML metadata has to be right, HTML content has to match, etc, etc. Browsers ontop of everything tend to auto-correct problems as well which makes debugging even harder because it can display fine in there.

Scraper-script logging does not always work as well, because you do not know 100% sure if it uses the same encoding, as the log file could be written in a different encoding as well and auto-conversion takes place. It's probably right and uses UTF-8, but when the log results do not match what you see with notepad from the source, confusion sets in.

Notepad++ offers some options to see better what is happening and adjust stuff on the fly both on display side and file side --

That leaves you with verifying that the HTTP header matches the encoding, for this F12 network tools in your bowser or Fiddler2 should offer insight.

So be weary when you see UTF-8 in HTTP header, in HTML metadata, but he actual content is in ISO-8859-1. That's one of the issues I was dealing with. It works fine in MovPic v1.2.x, but in v1.4.x the same thing failed for me. Overruling the encoding="..." part on <retrieve> did not fix it, which is where eventually site-owner came into play. Perhaps overruling the encoding will work for you, if that is even part of your problem.

Scraper setting retrieve variable to empty (1 Viewer)

Merlyn

Portal Pro

Attachments

Merlyn

Portal Pro

RoChess

Extension Developer

Merlyn

Portal Pro

RoChess

Extension Developer

Merlyn

Portal Pro

RoChess

Extension Developer

Users who are viewing this thread