Reply to thread

Message

The unsafe header turned out not to be the fix for me. Confusing, because the actual fix (site owner changing their encoding settings) was done at the same time I decided to try your fix. I'm still happy, because I only care for the end result.

Been working on all these scraper issues the entire week, fixing regexp is easy, but encoding has turned out to be the real pain.

There are just so many factors involved, HTTP header has to be correct, HTML metadata has to be right, HTML content has to match, etc, etc. Browsers ontop of everything tend to auto-correct problems as well which makes debugging even harder because it can display fine in there.

Scraper-script logging does not always work as well, because you do not know 100% sure if it uses the same encoding, as the log file could be written in a different encoding as well and auto-conversion takes place. It's probably right and uses UTF-8, but when the log results do not match what you see with notepad from the source, confusion sets in.

Notepad++ offers some options to see better what is happening and adjust stuff on the fly both on display side and file side -- [media=youtube]QbbmhYAhYYg[/media]

That leaves you with verifying that the HTTP header matches the encoding, for this F12 network tools in your bowser or Fiddler2 should offer insight.

So be weary when you see UTF-8 in HTTP header, in HTML metadata, but he actual content is in ISO-8859-1. That's one of the issues I was dealing with. It works fine in MovPic v1.2.x, but in v1.4.x the same thing failed for me. Overruling the encoding="..." part on <retrieve> did not fix it, which is where eventually site-owner came into play. Perhaps overruling the encoding will work for you, if that is even part of your problem.

<blockquote data-quote="RoChess" data-source="post: 944447" data-attributes="member: 18896">The unsafe header turned out not to be the fix for me. Confusing, because the actual fix (site owner changing their encoding settings) was done at the same time I decided to try your fix. I'm still happy, because I only care for the end result.&nbsp;Been working on all these scraper issues the entire week, fixing regexp is easy, but encoding has turned out to be the real pain.&nbsp;There are just so many factors involved, HTTP header has to be correct, HTML metadata has to be right, HTML content has to match, etc, etc. Browsers ontop of everything tend to auto-correct problems as well which makes debugging even harder because it can display fine in there.&nbsp;Scraper-script logging does not always work as well, because you do not know 100% sure if it uses the same encoding, as the log file could be written in a different encoding as well and auto-conversion takes place. It's probably right and uses UTF-8, but when the log results do not match what you see with notepad from the source, confusion sets in.&nbsp;Notepad++ offers some options to see better what is happening and adjust stuff on the fly both on display side and file side -- [media=youtube]QbbmhYAhYYg[/media]&nbsp;That leaves you with verifying that the HTTP header matches the encoding, for this F12 network tools in your bowser or Fiddler2 should offer insight.&nbsp;So be weary when you see UTF-8 in HTTP header, in HTML metadata, but he actual content is in ISO-8859-1. That's one of the issues I was dealing with. It works fine in MovPic v1.2.x, but in v1.4.x the same thing failed for me. Overruling the encoding=&quot;...&quot; part on &lt;retrieve&gt; did not fix it, which is where eventually site-owner came into play. Perhaps overruling the encoding will work for you, if that is even part of your problem.</blockquote>

[QUOTE="RoChess, post: 944447, member: 18896"] The unsafe header turned out not to be the fix for me. Confusing, because the actual fix (site owner changing their encoding settings) was done at the same time I decided to try your fix. I'm still happy, because I only care for the end result. Been working on all these scraper issues the entire week, fixing regexp is easy, but encoding has turned out to be the real pain. There are just so many factors involved, HTTP header has to be correct, HTML metadata has to be right, HTML content has to match, etc, etc. Browsers ontop of everything tend to auto-correct problems as well which makes debugging even harder because it can display fine in there. Scraper-script logging does not always work as well, because you do not know 100% sure if it uses the same encoding, as the log file could be written in a different encoding as well and auto-conversion takes place. It's probably right and uses UTF-8, but when the log results do not match what you see with notepad from the source, confusion sets in. Notepad++ offers some options to see better what is happening and adjust stuff on the fly both on display side and file side -- [media=youtube]QbbmhYAhYYg[/media] That leaves you with verifying that the HTTP header matches the encoding, for this F12 network tools in your bowser or Fiddler2 should offer insight. So be weary when you see UTF-8 in HTTP header, in HTML metadata, but he actual content is in ISO-8859-1. That's one of the issues I was dealing with. It works fine in MovPic v1.2.x, but in v1.4.x the same thing failed for me. Overruling the encoding="..." part on <retrieve> did not fix it, which is where eventually site-owner came into play. Perhaps overruling the encoding will work for you, if that is even part of your problem. [/QUOTE]