Scraper request - www.csfd.cz [CZ] (1 Viewer)

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
For now, try the attached v0.1.4

I tried your modified script and it doesn't work properly. There is problem with charset (see attached image). I was trying it on the movie Shutter Island (Prokletý ostrov).
I modified your modification :) I don't know if it is correct way, but I tried many combinations of charset of the file and charset in retrieve tags and this is the only one combination that works for me.
 

Attachments

  • si001.png
    si001.png
    164.8 KB
  • CSFD 0.1.5.xml
    30.6 KB

RoChess

Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    Weird, default encoding is UTF-8, so I took those encoding statements out, guess they are needed. Will check with developers.

    Thanks for fixing :)

    PS: You left IRC too quick for me to respond, we just got off-track on an SSD discussion.
     

    no.diggity

    MP Donator
  • Premium Supporter
  • March 26, 2009
    21
    0
    Home Country
    Czech Republic Czech Republic
    I tried version 0.1.5, same results - not working, gets info from imdb.com. My computer must have been cursed or something :(.

    Before StreamedMP I used the plugin alone with RC1 and as I can remember it worked fine...

    I try to get on IRC channel.
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    I tried version 0.1.5, same results - not working, gets info from imdb.com. My computer must have been cursed or something :(.

    Open CSFD.CZ link used by scraper manually in your browser please.

    Do you get any errors then?

    You might have to use FireFox with the userAgent switcher plugin and set userAgent value to be "Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)" for a proper test, because for example IMPAwards is now blocking that useragent (patch is being worked on).

    Maybe you have some software running that causes the wrong results to be fed back, the only way to solve that would be to install Fiddler2 and scan all the HTTP traffic that is going on.

    Fiddler Web Debugger - A free web debugging tool
     

    Trottel

    Portal Member
    February 18, 2009
    48
    26
    Liberec
    Home Country
    Czech Republic Czech Republic
    I tried version 0.1.5, same results - not working, gets info from imdb.com.
    All infos are from IMDb? Even Genres, Certification and Languages? And what is in Primary source?

    Today I have some problems with my script too. It looks like they made some changes on CSFD.cz pages so my script retrieve for 'Title' field only original titles and doesn't retrieve Czech titles. This shouldn't be difficult to fix it, but I have problem with summary.
    Script doesn't retrieve summary for most of movies (when Czech title is different from original title). Could it be caused by redirection?
    Open CSFD.CZ link used by scraper manually in your browser please.
    In this case is on the search result pege link to the movie www.csfd.cz/film/11970-bronx-tale-a/ so script creates movie.site_id with value "11970-bronx-tale-a/". But when you click on this link you are redirected to the page www.csfd.cz/film/11970-pribeh-z-bronxu/ and summary isn't retrieved.

    EDIT: Sorry, summary is retrieved for this movie, but isn't for some other movies: e.q. Resident Evil: Apocalypse, Robin Hood: Prince of Thieves,...
     

    no.diggity

    MP Donator
  • Premium Supporter
  • March 26, 2009
    21
    0
    Home Country
    Czech Republic Czech Republic
    The address you posted for test works ok - means I get list of search results for "Bronx Tale", on top position of the list is "Bronx Tale, A (1993)".

    I use Opera browser and never had any issues with CSFD database.

    I am not aware of any new software, which could affect the traffic, but if you think it would be helpful, I surely can provide log from Fiddler.

    to Trottel: yes, all info got from imdb.com, CSFD scraper is on top position, tried both options - automatic retrieval of movie data (set to czech language) and manually manage movie data sources (CSFD at top).
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    The address you posted for test works ok - means I get list of search results for "Bronx Tale", on top position of the list is "Bronx Tale, A (1993)".

    I use Opera browser and never had any issues with CSFD database.

    I am not aware of any new software, which could affect the traffic, but if you think it would be helpful, I surely can provide log from Fiddler.

    to Trottel: yes, all info got from imdb.com, CSFD scraper is on top position, tried both options - automatic retrieval of movie data (set to czech language) and manually manage movie data sources (CSFD at top).

    Your log file shows that "nothing" gets obtained by the actual scraper node, so everything else in the scraper script fails.

    So this is a problem in your MovingPictures plugin, your MediaPortal setup, your OS or anything else on your system because it works for Trottle and other users, so it is not the userAgent problem as with IMPAwards.

    Fiddler2 would allow you to see why you are not getting the HTML source code from CSFD.CZ website. Does it perhaps take an extremly long time for that link I gave you to show up in Opera? MovingPictures eventually gives up after 5 seconds, so if it takes longer for you, that would explain things.

    Run Fiddler2 in the background and capture everything, save it as a log file in Fiddler2 and upload to drop.io website, that will show me exactly where your problem lies (router, OS, ISP, etc).
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    Here is the address for Fiddler2 log, hope I did it right:
    drop.io ubs1pe7

    I also uploaded Fiddler2 log when using Opera and searching on ?esko-Slovenská filmová databáze - CSFD.cz for the title "A Bronx Tale".
    drop.io 18mhsdh

    As for your question, CSFD page pops up immediately, no delays at all.

    Thanks again for your time.

    I'm actually stumped on this one.

    On MovingPictures request, csfd.cz gives the following reply:

    [collapse]Request:

    Code:
    GET /hledani-filmu-hercu-reziseru-ve-filmove-databazi/?search=a%20bronx%20tale HTTP/1.1
    Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    User-Agent: Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
    Host: www.csfd.cz
    Connection: Keep-Alive

    Response:

    Code:
    HTTP/1.1 200 OK
    Date: Sat, 24 Apr 2010 05:33:27 GMT
    Server: Apache
    X-Powered-By: PHP/4.4.4-8+etch6
    Content-Length: 0
    Connection: close
    Content-Type: text/html
    [/collapse]

    Which means blank page. But on Opera you get:

    [collapse]
    Request:

    Code:
    GET /hledani-filmu-hercu-reziseru-ve-filmove-databazi/?search=a+bronx+tale HTTP/1.0
    User-Agent: Opera/9.80 (Windows NT 6.0; U; en-GB) Presto/2.5.22 Version/10.51
    Host: www.csfd.cz
    Accept: text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1
    Accept-Language: cs-CZ,cs;q=0.9,en;q=0.8
    Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1
    Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0
    Referer: http://www.csfd.cz/
    Cookie: nugg_participated=1; __utmz=1.1264939893.231.17.utmcsr=lopuch.cz|utmccn=(referral)|utmcmd=referral|utmcct=/home.php; __utma=1.27574078.1258749025.1264975235.1265042228.235; uid=ff35bbd63c; __gemius_fp=1256158464481_485547315; __utmz=215963397.1270665006.367.36.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=angel%20heart; __utma=215963397.911869909257995900.1242841366.1272085792.1272088486.406; __utmc=215963397; __utmb=215963397.1.10.1272088486
    Cookie2: $Version=1
    Connection: Keep-Alive

    Response:

    Code:
    HTTP/1.1 200 OK
    Date: Sat, 24 Apr 2010 05:54:56 GMT
    Server: Apache
    X-Powered-By: PHP/4.4.4-8+etch6
    Connection: close
    Content-Type: text/html; charset="utf-8"
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    (followed by the rest of the HTML source code)
    [/collapse]

    Now normally I would blame it on userAgent and Charset/Encoding problems, but that doesn't explain why the same plugin+scraper combination works for a lot of other people.

    So it has to be something on your setup, perhaps a setting, plugin, or corrupted file on MediaPortal and/or MovingPictures causing this.

    I recommend you rename the program and data folder on MediaPortal into "MediaPortal.old" and then reinstall it fresh, don't add anything extra, except MovingPictures, the scraper and configure the import path to a folder containing 2 movies. See if it works then. If it does, you'll have your work cut out to debug what part is broken on your normal install.
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    On a better note, I finished fixing all the RegExp code changes that were needed to counteract all the changes CSFD.CZ did to their website, so everything should work again for Title/Summary/etc.

    I even took the liberty to clean up a few minor issues to make things display better such as trimming spaces on some values.

    Enjoy.
     

    Attachments

    • CSFD 0.1.7.xml
      30.6 KB

    Users who are viewing this thread

    Similar threads

    Yes, unfortunately, this would need a code change to support it. Next time I'm doing something on the plugin I'll try and remember to add support for this sort order.
    Yes, unfortunately, this would need a code change to support it. Next time I'm doing something on the plugin I'll try and...
    Hi, I was wondering if there is anyone who might be able to help me out. Is there a way to either use the sort feature and/or a...
    Replies
    6
    Views
    824
    There is an issue with missing #fanarthandler.movie.clearart.selected path value when MovingPictures loads a list of movies. When the movie is changed from the first item in the list, the value populates. I have a working solution to this issue and it requires an update to the Cornerstone.MP.Extensions project class...
    There is an issue with missing #fanarthandler.movie.clearart.selected path value when MovingPictures loads a list of movies. When...
    There is an issue with missing #fanarthandler.movie.clearart.selected path value when MovingPictures loads a list of movies. When...
    Replies
    0
    Views
    119
    It works perfect now. All channels have been found, even with the default tuning files. Thank you very much! (y) I'd love seing this change in MP 1.33! :)
    It works perfect now. All channels have been found, even with the default tuning files. Thank you very much! (y) I'd love seing...
    With my new hardware, Windows 11 and MP 1.32 the TV channel scan is not working anymore. The frequencies are scanned but no...
    Replies
    40
    Views
    3K
    I have all of my media on a NAS. I guess it might just be a network issue, then.
    I have all of my media on a NAS. I guess it might just be a network issue, then.
    Whenever I go into the back end for Moving Pictures, it almost immediately hangs on the Movie Importer tab. If I want to go into...
    Replies
    4
    Views
    482
    Check this web page for the informations about he plugin. To fix the issue, I recommend to uninstal and install the plugin again (if you need it).
    Check this web page for the informations about he plugin. To fix the issue, I recommend to uninstal and install the plugin again...
    I've just installed MP 1.31 with Streamed skin, IMDB+ plugin, Moving Pictures and MP-TV Series. First time starting up MP to...
    Replies
    5
    Views
    560
    Top Bottom