Scraper, cookies, cache (1 Viewer)

piernik

Portal Pro
October 22, 2008
141
26
I'm responsible for polish filmweb.pl scraper.
I have one problem - filmweb.pl has two different ads cookies.
Can I assign two cookies in one site fatching?

If not... I'm fetching site two times, but on the second one it uses cache - I don't want cache, because I'm using different cookie to skip ads, so I ecpect different site's code.
How can I do this?
 

fforde

Community Plugin Dev
June 7, 2007
2,667
1,702
43
Texas
Home Country
United States of America United States of America
Cookies for the current session are stored in a variable like this "urn://scraper/header/www.imdb.com". The last bit of course would vary depending on the domain. I have never tried this and the system was kind of built to just always maintain cookies for the current session, but you could probably prevent this by manually clearing the variable. Something like:

Code:
<set name="urn://scraper/header/www.imdb.com"></set>

Let me know if this works for you. If it does not it would probably require a code enhancement (which I could do for you).
 

piernik

Portal Pro
October 22, 2008
141
26
I don't know weather we understood each other.

Here is what I've got:
Code:
<set name="filmweb_url" value="http://www.filmweb.pl/search/film?q=${search.title:safe(utf-8)}" />
	<retrieve name="search_page" url="${filmweb_url}" allow_unsafe_header="true"  cookies="welcomeScreen=welcome_screen"/>

	<!-- if ad page try again -->
	<parse name="check" input="${search_page}" regex="${rx_ad_check}" />
	<if test='${check[0][0]}!='>
		<retrieve name="search_page" url="${filmweb_url}" allow_unsafe_header="true"  cookies="welcomeScreenNew=welcome_screen"/>
	<!-- here is the same site's code since it uses cached code from previous retrieve -->
	</if>

Is something like this possible? (now I've got error)

Code:
<retrieve name="search_page" url="${filmweb_url}" allow_unsafe_header="true"  cookies="welcomeScreen=welcome_screen&welcomeScreenNew=welcome_screen"/>

If not how to force scrpaer not to use cached site?
 

RoChess

Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    It's still XML syntax, so you can not use & char like that, you would have to use &amp; or [CDATA escape it.
     

    fforde

    Community Plugin Dev
    June 7, 2007
    2,667
    1,702
    43
    Texas
    Home Country
    United States of America United States of America
    I see. Add this to your retrieve call: use_caching="false"

    Also RoChess is right you will need to escape that ampersand. Change it to "&amp;" and you should be good.
     

    Users who are viewing this thread

    Top Bottom