Scraper request - www.csfd.cz [CZ] (1 Viewer)

Trottel

Portal Member
February 18, 2009
48
26
Liberec
Home Country
Czech Republic Czech Republic
For now, try the attached v0.1.4

I tried your modified script and it doesn't work properly. There is problem with charset (see attached image). I was trying it on the movie Shutter Island (Prokletý ostrov).
I modified your modification :) I don't know if it is correct way, but I tried many combinations of charset of the file and charset in retrieve tags and this is the only one combination that works for me.
 

Attachments

  • si001.png
    si001.png
    164.8 KB
  • CSFD 0.1.5.xml
    30.6 KB

RoChess

Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    Weird, default encoding is UTF-8, so I took those encoding statements out, guess they are needed. Will check with developers.

    Thanks for fixing :)

    PS: You left IRC too quick for me to respond, we just got off-track on an SSD discussion.
     

    no.diggity

    MP Donator
  • Premium Supporter
  • March 26, 2009
    21
    0
    Home Country
    Czech Republic Czech Republic
    I tried version 0.1.5, same results - not working, gets info from imdb.com. My computer must have been cursed or something :(.

    Before StreamedMP I used the plugin alone with RC1 and as I can remember it worked fine...

    I try to get on IRC channel.
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    I tried version 0.1.5, same results - not working, gets info from imdb.com. My computer must have been cursed or something :(.

    Open CSFD.CZ link used by scraper manually in your browser please.

    Do you get any errors then?

    You might have to use FireFox with the userAgent switcher plugin and set userAgent value to be "Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)" for a proper test, because for example IMPAwards is now blocking that useragent (patch is being worked on).

    Maybe you have some software running that causes the wrong results to be fed back, the only way to solve that would be to install Fiddler2 and scan all the HTTP traffic that is going on.

    Fiddler Web Debugger - A free web debugging tool
     

    Trottel

    Portal Member
    February 18, 2009
    48
    26
    Liberec
    Home Country
    Czech Republic Czech Republic
    I tried version 0.1.5, same results - not working, gets info from imdb.com.
    All infos are from IMDb? Even Genres, Certification and Languages? And what is in Primary source?

    Today I have some problems with my script too. It looks like they made some changes on CSFD.cz pages so my script retrieve for 'Title' field only original titles and doesn't retrieve Czech titles. This shouldn't be difficult to fix it, but I have problem with summary.
    Script doesn't retrieve summary for most of movies (when Czech title is different from original title). Could it be caused by redirection?
    Open CSFD.CZ link used by scraper manually in your browser please.
    In this case is on the search result pege link to the movie www.csfd.cz/film/11970-bronx-tale-a/ so script creates movie.site_id with value "11970-bronx-tale-a/". But when you click on this link you are redirected to the page www.csfd.cz/film/11970-pribeh-z-bronxu/ and summary isn't retrieved.

    EDIT: Sorry, summary is retrieved for this movie, but isn't for some other movies: e.q. Resident Evil: Apocalypse, Robin Hood: Prince of Thieves,...
     

    no.diggity

    MP Donator
  • Premium Supporter
  • March 26, 2009
    21
    0
    Home Country
    Czech Republic Czech Republic
    The address you posted for test works ok - means I get list of search results for "Bronx Tale", on top position of the list is "Bronx Tale, A (1993)".

    I use Opera browser and never had any issues with CSFD database.

    I am not aware of any new software, which could affect the traffic, but if you think it would be helpful, I surely can provide log from Fiddler.

    to Trottel: yes, all info got from imdb.com, CSFD scraper is on top position, tried both options - automatic retrieval of movie data (set to czech language) and manually manage movie data sources (CSFD at top).
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    The address you posted for test works ok - means I get list of search results for "Bronx Tale", on top position of the list is "Bronx Tale, A (1993)".

    I use Opera browser and never had any issues with CSFD database.

    I am not aware of any new software, which could affect the traffic, but if you think it would be helpful, I surely can provide log from Fiddler.

    to Trottel: yes, all info got from imdb.com, CSFD scraper is on top position, tried both options - automatic retrieval of movie data (set to czech language) and manually manage movie data sources (CSFD at top).

    Your log file shows that "nothing" gets obtained by the actual scraper node, so everything else in the scraper script fails.

    So this is a problem in your MovingPictures plugin, your MediaPortal setup, your OS or anything else on your system because it works for Trottle and other users, so it is not the userAgent problem as with IMPAwards.

    Fiddler2 would allow you to see why you are not getting the HTML source code from CSFD.CZ website. Does it perhaps take an extremly long time for that link I gave you to show up in Opera? MovingPictures eventually gives up after 5 seconds, so if it takes longer for you, that would explain things.

    Run Fiddler2 in the background and capture everything, save it as a log file in Fiddler2 and upload to drop.io website, that will show me exactly where your problem lies (router, OS, ISP, etc).
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    Here is the address for Fiddler2 log, hope I did it right:
    drop.io ubs1pe7

    I also uploaded Fiddler2 log when using Opera and searching on ?esko-Slovenská filmová databáze - CSFD.cz for the title "A Bronx Tale".
    drop.io 18mhsdh

    As for your question, CSFD page pops up immediately, no delays at all.

    Thanks again for your time.

    I'm actually stumped on this one.

    On MovingPictures request, csfd.cz gives the following reply:

    [collapse]Request:

    Code:
    GET /hledani-filmu-hercu-reziseru-ve-filmove-databazi/?search=a%20bronx%20tale HTTP/1.1
    Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    User-Agent: Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
    Host: www.csfd.cz
    Connection: Keep-Alive

    Response:

    Code:
    HTTP/1.1 200 OK
    Date: Sat, 24 Apr 2010 05:33:27 GMT
    Server: Apache
    X-Powered-By: PHP/4.4.4-8+etch6
    Content-Length: 0
    Connection: close
    Content-Type: text/html
    [/collapse]

    Which means blank page. But on Opera you get:

    [collapse]
    Request:

    Code:
    GET /hledani-filmu-hercu-reziseru-ve-filmove-databazi/?search=a+bronx+tale HTTP/1.0
    User-Agent: Opera/9.80 (Windows NT 6.0; U; en-GB) Presto/2.5.22 Version/10.51
    Host: www.csfd.cz
    Accept: text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1
    Accept-Language: cs-CZ,cs;q=0.9,en;q=0.8
    Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1
    Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0
    Referer: http://www.csfd.cz/
    Cookie: nugg_participated=1; __utmz=1.1264939893.231.17.utmcsr=lopuch.cz|utmccn=(referral)|utmcmd=referral|utmcct=/home.php; __utma=1.27574078.1258749025.1264975235.1265042228.235; uid=ff35bbd63c; __gemius_fp=1256158464481_485547315; __utmz=215963397.1270665006.367.36.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=angel%20heart; __utma=215963397.911869909257995900.1242841366.1272085792.1272088486.406; __utmc=215963397; __utmb=215963397.1.10.1272088486
    Cookie2: $Version=1
    Connection: Keep-Alive

    Response:

    Code:
    HTTP/1.1 200 OK
    Date: Sat, 24 Apr 2010 05:54:56 GMT
    Server: Apache
    X-Powered-By: PHP/4.4.4-8+etch6
    Connection: close
    Content-Type: text/html; charset="utf-8"
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    (followed by the rest of the HTML source code)
    [/collapse]

    Now normally I would blame it on userAgent and Charset/Encoding problems, but that doesn't explain why the same plugin+scraper combination works for a lot of other people.

    So it has to be something on your setup, perhaps a setting, plugin, or corrupted file on MediaPortal and/or MovingPictures causing this.

    I recommend you rename the program and data folder on MediaPortal into "MediaPortal.old" and then reinstall it fresh, don't add anything extra, except MovingPictures, the scraper and configure the import path to a folder containing 2 movies. See if it works then. If it does, you'll have your work cut out to debug what part is broken on your normal install.
     

    RoChess

    Extension Developer
  • Premium Supporter
  • March 10, 2006
    4,434
    1,897
    On a better note, I finished fixing all the RegExp code changes that were needed to counteract all the changes CSFD.CZ did to their website, so everything should work again for Title/Summary/etc.

    I even took the liberty to clean up a few minor issues to make things display better such as trimming spaces on some values.

    Enjoy.
     

    Attachments

    • CSFD 0.1.7.xml
      30.6 KB

    Users who are viewing this thread

    Similar threads

    Are the media/video folders on the Mint host?
    Are the media/video folders on the Mint host?
    I've used Mediaportal for years on a Windows Host with restricted online-access. Lately I reinstalled...
    Replies
    1
    Views
    1K
    I haven't tested it. I use MP1, but sometimes I try to help with MP2 :).
    I haven't tested it. I use MP1, but sometimes I try to help with MP2 :).
    This happened immediately after my Windows 11 x64 monthly update for November. The update included also cumulative update to .NET...
    Replies
    9
    Views
    2K
    I don't know about MP2 but as you said you tried MP1 too, how is the MP music config ? MP(1) will only use LAV if you select "Internal DirectShow player" as music output, but then you lose gapless playback. If you you don't have multichannel music you can choose WASAPI as the output and set the number of speakers to stereo. I have...
    I don't know about MP2 but as you said you tried MP1 too, how is the MP music config ? MP(1) will only use LAV if you select...
    Not sure if this a a bug/config/settings problem. I am running a media portal 2.5 server with 2.41 client but it seems I get the...
    Replies
    1
    Views
    660
    Update: Problem solved! The electronic program guides (tvguide.xml) have been successfully transferred to MediaPortal. The issue stemmed from the corruption of one or both of the mc2xml.dat and/or mc2xml.exe files, for an unknown reason. Consequently, even though the tvguide.xml file appeared to be updated, its content kept being an...
    Update: Problem solved! The electronic program guides (tvguide.xml) have been successfully transferred to MediaPortal. The issue...
    I was having a problem with the EPG on MP2 1.4.1 (although it used to run without any issues for a while). So I installed MP2 1.5...
    Replies
    8
    Views
    2K
    MP1 MP2 MP2 - V2.5 MP2 - V2.5 Server on Windows 11 Pro DE
    Good evening, the last weeks my MP2.5 Server with TVE3 is running not very good/stable. After 30 min the server quits working after changing EPG from EPG Buddie to Clickfinder (TV Movie). The OS and the MP2.5 Server Software were installed several times new, but in general nothing changed. Are there any changes or setting needed for...
    Good evening, the last weeks my MP2.5 Server with TVE3 is running not very good/stable. After 30 min the server quits working...
    Good evening, the last weeks my MP2.5 Server with TVE3 is running not very good/stable. After 30 min the server quits working...
    Replies
    0
    Views
    1K
    Top Bottom