Reply to thread

Message

<blockquote data-quote="JiRo" data-source="post: 710602" data-attributes="member: 91312">Re: CSFD scraper script 0.1.9 [CZ] - 100% succes hit (558 movies)<ul>
<li data-xf-list-type="ul"> 
1st of all - Trottel, many <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class="smilie smilie--sprite smilie--sprite8" alt=":D" title="Big Grin&nbsp; &nbsp; :D" loading="lazy" data-shortname=":D" /> for your perfect work. But... 
 
When I have used first time your scraper script, I have reached 40% succesfull hits. It was in excess of former version of scraper, but still poor. My friend has 100% hit, but he uses english names of movie files and IMDB scraper. My target was 100% hit with czech names and CSFD scraper too <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class="smilie smilie--sprite smilie--sprite8" alt=":D" title="Big Grin&nbsp; &nbsp; :D" loading="lazy" data-shortname=":D" />. I have started to read your script and found 1st small problem: 
 
&nbsp; &nbsp; &lt;set name=&quot;rx_search_results_block&quot;&gt; 
&nbsp; &nbsp; &nbsp; &lt;![CDATA[ 
&nbsp; &nbsp; &nbsp; &gt;v originálních názvech&lt;/td&gt;.+&lt;/body&gt; 
&nbsp; &nbsp; &nbsp; ]]&gt; 
&nbsp; &nbsp; &lt;/set&gt; 
 
expression &quot;&gt;v českých názvech&quot; causes jump of czech movie names. Therefore I have replaced &quot;&gt;v originálních názvech&quot; by &quot;&gt;v českých názvech&quot;. Result was much better then before. But some of czech movies that were succesfull before, were without hit now. Then I read your script more carefully and I have tried test on the CSFD web page. Whereon I found out that some czech movies aren't in &quot;&gt;v českých názvech&quot; section but in &quot;&gt;v originálních názvech&quot; <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class="smilie smilie--sprite smilie--sprite9" alt=":eek:" title="Eek!&nbsp; &nbsp; :eek:" loading="lazy" data-shortname=":eek:" /> and Czech section absent. 
Therefore I changed regular expresion part to: 
 
&nbsp; &nbsp; &lt;set name=&quot;rx_search_results_block&quot;&gt; 
&nbsp; &nbsp; &nbsp; &lt;![CDATA[ 
&nbsp; &nbsp; &nbsp; &gt;v českých názvech&lt;/td&gt;.+&lt;/body&gt; 
&nbsp; &nbsp; &nbsp; ]]&gt; 
&nbsp; &nbsp; &lt;/set&gt; 
 
&nbsp; &nbsp; &lt;set name=&quot;rx_search_results_block2&quot;&gt; 
&nbsp; &nbsp; &nbsp; &lt;![CDATA[ 
&nbsp; &nbsp; &nbsp; &gt;v originálních názvech&lt;/td&gt;.+&lt;/body&gt; 
&nbsp; &nbsp; &nbsp; ]]&gt; 
&nbsp; &nbsp; &lt;/set&gt; 
 
and part of code to: 
 
&nbsp; &nbsp; ... 
&nbsp; &nbsp; &lt;parse name=&quot;search_results_block&quot; input=&quot;${search_page}&quot; regex=&quot;${rx_search_results_block}&quot;/&gt; 
&nbsp; &nbsp; &lt;if test=&quot;${search_results_block}=&quot;&gt; 
&nbsp; &nbsp; &nbsp; &lt;parse name=&quot;search_results_block&quot; input=&quot;${search_page}&quot; regex=&quot;${rx_search_results_block2}&quot;/&gt; 
&nbsp; &nbsp; &lt;/if&gt; 
&nbsp; &nbsp; &lt;if test=&quot;${search_results_block}!=&quot;&gt; 
&nbsp; &nbsp; &nbsp; &lt;loop name=&quot;search_results_verified&quot; on=&quot;search_results_block&quot;&gt; 
&nbsp; &nbsp; &nbsp; ... 
 
Last change I did by number of searched movie, from previous 20 to 100. Few movies have serch result list very long... 
 
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; ... 
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &lt;set name=&quot;movie[${counter}].details_url&quot; value=&quot;${site}film/${curr_details[0]}&quot;/&gt; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;subtract name=&quot;movie[${counter}].popularity&quot; value1=&quot;100&quot; value2=&quot;${counter}&quot; /&gt; 
&nbsp; &nbsp; &nbsp; &nbsp; &lt;/loop&gt; 
&nbsp; &nbsp; &nbsp; &nbsp; ... 
 
Now I'm satisfied. The target 100% hit is achived! <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class="smilie smilie--sprite smilie--sprite7" alt=":p" title="Stick Out Tongue&nbsp; &nbsp; :p" loading="lazy" data-shortname=":p" /> and your condition: 
<ul>
<li data-xf-list-type="ul"> 
Movie name should be in original or English language </li>
</ul> 
can be extended to: 
<ul>
<li data-xf-list-type="ul"> 
Movie name should be in Czech, original or English language </li>
</ul> 
Maybe we should find out if exist movies with English name only <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class="smilie smilie--sprite smilie--sprite9" alt=":eek:" title="Eek!&nbsp; &nbsp; :eek:" loading="lazy" data-shortname=":eek:" /> 
 
Curretly I have private 0.1.10 version of CSFD scraper <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class="smilie smilie--sprite smilie--sprite10" alt=":oops:" title="Oops!&nbsp; &nbsp; :oops:" loading="lazy" data-shortname=":oops:" />, but official release is up to you. You are author! 
 
JiRo.</li>
</ul></blockquote>

[QUOTE="JiRo, post: 710602, member: 91312"] [b]Re: CSFD scraper script 0.1.9 [CZ] - 100% succes hit (558 movies)[/b] [LIST] 1st of all - Trottel, many :thx: for your perfect work. But... When I have used first time your scraper script, I have reached 40% succesfull hits. It was in excess of former version of scraper, but still poor. My friend has 100% hit, but he uses english names of movie files and IMDB scraper. My target was 100% hit with czech names and CSFD scraper too :D. I have started to read your script and found 1st small problem: <set name="rx_search_results_block"> <![CDATA[ >v originálních názvech</td>.+</body> ]]> </set> expression ">v českých názvech" causes jump of czech movie names. Therefore I have replaced ">v originálních názvech" by ">v českých názvech". Result was much better then before. But some of czech movies that were succesfull before, were without hit now. Then I read your script more carefully and I have tried test on the CSFD web page. Whereon I found out that some czech movies aren't in ">v českých názvech" section but in ">v originálních názvech" :o and Czech section absent. Therefore I changed regular expresion part to: [B]<set name="rx_search_results_block"> <![CDATA[ >v českých názvech</td>.+</body> ]]> </set> <set name="rx_search_results_block2"> <![CDATA[ >v originálních názvech</td>.+</body> ]]> </set>[/B] and part of code to: ... <parse name="search_results_block" input="${search_page}" regex="${rx_search_results_block}"/> [B] <if test="${search_results_block}="> <parse name="search_results_block" input="${search_page}" regex="${rx_search_results_block2}"/> </if>[/B] <if test="${search_results_block}!="> <loop name="search_results_verified" on="search_results_block"> ... Last change I did by number of searched movie, from previous 20 to 100. Few movies have serch result list very long... ... <set name="movie[${counter}].details_url" value="${site}film/${curr_details[0]}"/> <subtract name="movie[${counter}].popularity" value1="[B]100[/B]" value2="${counter}" /> </loop> ... Now I'm satisfied. The target 100% hit is achived! :P and your condition: [LIST] Movie name should be in original or English language [/LIST] can be extended to: [LIST] Movie name should be in Czech, original or English language [/LIST] Maybe we should find out if exist movies with English name only :eek: Curretly I have private 0.1.10 version of CSFD scraper :ooops:, but official release is up to you. You are author! JiRo.[/LIST] [/QUOTE]

Verification