NP_SpamBayes 1.1.0 done !

Post your new plugins here!
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Thu Jan 25, 2007 4:02 pm

you could edit the source and change the catcode field into a length of 10 instead of 250 it only needs to contain 'ham' and 'spam'. That would solve your prob
look for catcode varchar(250) ==> catcode varchar(10)
(Found in the function 'install'). Looks like your host has utf8 or some other unicode codepage on it's tables by default so you now have exact 1000 bytes for the key ....
User avatar
fishy
Posts: 21
Joined: Tue Nov 21, 2006 7:35 pm
Location: Beijing, China
Contact:

Postby fishy » Thu Jan 25, 2007 4:18 pm

Thanks, it works! :D

And yes, I'm using utf8 as my db charset
My recently played tracks list on last.fm:
Image
User avatar
Leng
Nucleus Guru
Nucleus Guru
Posts: 2827
Joined: Sun Sep 19, 2004 2:34 am
Location: Australia
Contact:

Postby Leng » Fri Jan 26, 2007 1:17 am

Just checking in with a couple of more things I've noticed about SpamBayes.

There are some very obvious types of spam comments that are getting picked up as ham, such as

Code: Select all

Best site to buy, cheap, order Lipitor <a href="http://wolvas.org.uk/images/articles/online/www/h/2/index.html">Lipitor</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-side-effects.html">Lipitor Side Effects</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-vs-zocor.html">Lipitor Vs Zocor</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-zocor.html">Lipitor Zocor</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Generic-lipitor.html">Generic Lipitor</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-side-effects-muscle-pain.html">Lipitor Side Effects Muscle Pain</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Drug-lipitor.html">Drug Lipitor</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-muscle-pain.html">Lipitor Muscle Pain</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Medication-side-effects-lipitor.html">Medication Side Effects Lipitor</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-and-memory-loss.html">Lipitor And Memory Loss</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-lawsuit.html">Lipitor Lawsuit</a> <a href="http://wolvas.org.uk/images/articles/online/www/h/2/Lipitor-and-alcohol.html">Lipitor And Alcohol</a> Sypp Sypp


Is there any way to make SpamBayes look at the frequency of words appearing in comments to decide if something is spam or not? For example, a real ham comment would not contain more than 3 to 5 links maximum in a comment (usually only 1, if any at all). However, a large number of spam will have upwards of 10 or more. Perhaps the user could set a "threshold" for a given word or phrase which SpamBayes will use as an additional guide to judge if something is ham or spam.

Another suggestion (and I have a feeling this one may be too hard to implement/not feasible) is if SpamBayes could do a check of words against a dictionary? While not all comments have impeccable spelling, a lot of spam words wouldn't be in a normal dictionary. So if something is badly spelled, chances are it is spam.

Last problem is I have a lot of these type of spams coming through:

Code: Select all

While site keep Good work 81.177.0.6 appa[3!] appa[3!] 81.177.0.6


I'm at quite a loss as to how to keep them out. I keep training them as spam but SpamBayes just doesn't seem to be able to tell. :( This may be helped if we can filter out certain phrases as these bots seem to use the same ones over and over again.
Image
deborahlau.com | To-Do List
Questions? See the FAQ, read the docs, or browse our plugins!!
User avatar
matt_t_hat
Posts: 1123
Joined: Sun Aug 21, 2005 4:45 pm
Location: UK
Contact:

Postby matt_t_hat » Fri Feb 02, 2007 2:25 pm

Leng wrote:Just checking in with a couple of more things I've noticed about SpamBayes.

There are some very obvious types of spam comments that are getting picked up as ham, such as

Code: Select all

Best site to buy...clip...


Is there any way to make SpamBayes look at the frequency of words appearing in comments to decide if something is spam or not? For example, a real ham comment would not contain more than 3 to 5 links maximum in a comment (usually only 1, if any at all). However, a large number of spam will have upwards of 10 or more. Perhaps the user could set a "threshold" for a given word or phrase which SpamBayes will use as an additional guide to judge if something is ham or spam.

Another suggestion (and I have a feeling this one may be too hard to implement/not feasible) is if SpamBayes could do a check of words against a dictionary? While not all comments have impeccable spelling, a lot of spam words wouldn't be in a normal dictionary. So if something is badly spelled, chances are it is spam.

Last problem is I have a lot of these type of spams coming through:

Code: Select all

While site keep Good work 81.177.0.6 appa[3!] appa[3!] 81.177.0.6


I'm at quite a loss as to how to keep them out. I keep training them as spam but SpamBayes just doesn't seem to be able to tell. :( This may be helped if we can filter out certain phrases as these bots seem to use the same ones over and over again.


These would pattern match very well so that SpamCheck as SC.SpecialRules.php (say) there is a chance I'll be writting a few of those alter this year so keep an eye on me.
User avatar
matt_t_hat
Posts: 1123
Joined: Sun Aug 21, 2005 4:45 pm
Location: UK
Contact:

Postby matt_t_hat » Fri Feb 02, 2007 2:29 pm

PHYFS - http://wakka.xiffy.nl/phyfs would deal with the first case (too many http) but I've not used it. Personally I think ti should be a SC.plugin... might do that.

Are you running a black list at all? Because I would have thought that these shorties would be bounced by BL?

That said if the plugin we are ment to be talking about can use "SEO words" from URLs too that'd trip up the smarty pants spammers.
User avatar
Leng
Nucleus Guru
Nucleus Guru
Posts: 2827
Joined: Sun Sep 19, 2004 2:34 am
Location: Australia
Contact:

Postby Leng » Sat Feb 03, 2007 5:41 am

Wow, I completely missed that plugin, thanks for pointing it out. I've installed it now so hopefully it will cut down on the amount.

No, I am not running any sort of blacklist actually, just relying on SpamBayes so far. :D
Image

deborahlau.com | To-Do List
Questions? See the FAQ, read the docs, or browse our plugins!!
MacFrog
Posts: 113
Joined: Thu Aug 28, 2003 3:54 am

Postby MacFrog » Mon Feb 05, 2007 10:11 pm

Leng wrote:Just checking in with a couple of more things I've noticed about SpamBayes.

There are some very obvious types of spam comments that are getting picked up as ham, such as


Is there any way to make SpamBayes look at the frequency of words appearing in comments to decide if something is spam or not? For example, a real ham comment would not contain more than 3 to 5 links maximum in a comment (usually only 1, if any at all). However, a large number of spam will have upwards of 10 or more. Perhaps the user could set a "threshold" for a given word or phrase which SpamBayes will use as an additional guide to judge if something is ham or spam.


Huh my site reconizes those as spam. Then again i've got 16k spam words in my DB and over 120k HAM words. How many do you have?

I know from work experience that filters like these only become reliable as their refrence point becomes very very large. And even then they're not 100%
User avatar
wgroleau
Posts: 402
Joined: Sat Jun 10, 2006 4:20 pm
Location: Indiana / USA

New options ??

Postby wgroleau » Wed Feb 14, 2007 10:18 pm

I downloaded SpamBayes 1.1.0

Compared the files and saw that there are some new options.

But after I installed it, I don't see those options in the list.

Do I have to uninstall first? I think I already asked that and someone said No.
Wes Groleau
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Wed Feb 14, 2007 10:21 pm

When in doubt, just uninstall and install. The tables won't get deleted so your training sessions won't be lost.
but which versions did you compare? i can't remember adding or deleting option.
MacFrog
Posts: 113
Joined: Thu Aug 28, 2003 3:54 am

Postby MacFrog » Tue Feb 27, 2007 10:27 pm

Any word on version 1.2?

The ability to adjust words sounds neat!

Another suggestion:

Some sort of pre-loaded spam included with the database. My problem (found earlier in the thread) was due to having so many comments in my blog.

When I clicked on "train all ham" it produced so many ham that anything not included in the database was marked as spam ... Training more spam fixed that problem (Im currenty at .8 ham and .2 spam).

I don't mind submitting mine for inclusion (I assume just the spam portion) however some sort of universal seed would be nice.
User avatar
wgroleau
Posts: 402
Joined: Sat Jun 10, 2006 4:20 pm
Location: Indiana / USA

Re: New options ??

Postby wgroleau » Thu Mar 01, 2007 1:18 am

wgroleau wrote:Compared the files and saw that there are some new options.
But after I installed it, I don't see those options in the list.


I found two of these options. Perhaps I was mistaken about the third.
I don't remember what it was--I just saw them by scrolling through the source.

Would be nice to have the opposite of "publish," i.e., when looking at the log,
if you see a false negative, click "FRY" and it is re-classified spam and removed
from the comments.
Wes Groleau
User avatar
admun
Nucleus Guru
Nucleus Guru
Posts: 4088
Joined: Mon Oct 20, 2003 2:57 am
Location: San Francisco, CA, USA
Contact:

Postby admun » Mon May 07, 2007 3:43 pm

I am reviewing some plugin code and noticed the plugin is miss implementing getTableList() since it created a database table..... a suggestion to add it. :wink:
cyblot
Nucleus Guru
Nucleus Guru
Posts: 399
Joined: Tue Sep 16, 2003 8:49 pm
Location: Netherlands
Contact:

Postby cyblot » Wed Nov 07, 2007 12:02 am

verbaljam wrote:1. When not selecting any item and selecting training ham with the drop down at the bottom of the page, I get the error:

Code: Select all

Warning: Invalid argument supplied for foreach() in /home/virtual/site136/fst/var/www/html/nucleus/plugins/spambayes/index.php on line 261
--end of batch--



Using 1.1.0 on a Nucleus 3.31 installation, I also get this error message. It did not happen on my 3.24 install, but I can't say for sure if this started right after upgrading, or if it is a problem that turned up a couple of days after the upgrade.

The error shows whenever I try to use any of the "with selected" options. It does not show up when I just want to train a single item using the option to the right of the screen. Which makes sense, since only training one entry probably doesn't use a "foreach" statement.

Any ideas what this might be and how I can fix it?
Blots of Info
http://www.golb.org
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Wed Nov 07, 2007 12:16 am

Strange and weird.
The problem can only occur inside the function sb_batch. There is a debug line which you could uncomment (and remove the complete //debug:
it should tell what is inside the logids variable.
Maybe, thise line in the function:

Code: Select all

      $logids = requestIntArray(batch);

is the ulprit and should be replaced by

Code: Select all

      $logids = requestIntArray('batch');

That's all I can come up with ...
Could you copy the querystring that get's posted? The part after the questionmark. It should read something like this:

Code: Select all

batch%5B0%5D=803657&batch%5B1%5D=803656&batchaction=tspam&page=batch&ipp=10&filter=all&filtertype=all&keyword=
cyblot
Nucleus Guru
Nucleus Guru
Posts: 399
Joined: Tue Sep 16, 2003 8:49 pm
Location: Netherlands
Contact:

Postby cyblot » Wed Nov 07, 2007 8:43 am

xiffy wrote:Strange and weird.
The problem can only occur inside the function sb_batch. There is a debug line which you could uncomment (and remove the complete //debug:
it should tell what is inside the logids variable.


At the risk of sounding stupid ;) where do I find this function, in which file? I tried the Spambayes files, but could not find it, nor the 'logids' line you mentioned. I'll check all Nucleus files tonight, but I'm late for work.

Could you copy the querystring that get's posted? The part after the questionmark. It should read something like this:

Code: Select all

batch%5B0%5D=803657&batch%5B1%5D=803656&batchaction=tspam&page=batch&ipp=10&filter=all&filtertype=all&keyword=


It reads:

Code: Select all

batch%5B0%5D=100&batch%5B1%5D=98&batch%5B2%5D=99&batch%5B3%5D=97&batch%5B4%5D=96&batch%5B5%5D=95&batch%5B6%5D=93&batch%5B7%5D=94&batch%5B8%5D=91&batch%5B9%5D=92&batchaction=tspam&page=batch&ipp=10&filter=all&filtertype=all&keyword=


and decoded:

Code: Select all

batch[0]=100&batch[1]=98&batch[2]=99&batch[3]=97&batch[4]=96&batch[5]=95&batch[6]=93&batch[7]=94&batch[8]=91&batch[9]=92&batchaction=tspam&page=batch&ipp=10&filter=all&filtertype=all&keyword=



Thank you very much for your help. Could it be because of weird characters in one of the entries in the logfile? (not that I identified one yet, though)
Blots of Info

http://www.golb.org
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Wed Nov 07, 2007 10:05 am

Ah sorry forgot to mention; the function is inside
spambayes/index.php (That is the admin file of the plugin)
The url looks okay to me.
And weird characters could not be the problem at this point. It's the batch array that does not get parsed, and that's before the event is trained.
Anyway let's see what the debug tells us tonight!
cyblot
Nucleus Guru
Nucleus Guru
Posts: 399
Joined: Tue Sep 16, 2003 8:49 pm
Location: Netherlands
Contact:

Postby cyblot » Wed Nov 07, 2007 8:42 pm

xiffy wrote:Strange and weird.
The problem can only occur inside the function sb_batch. There is a debug line which you could uncomment (and remove the complete //debug:
it should tell what is inside the logids variable.


When I remove the debug code, it returns "Array".

Maybe, thise line in the function:

Code: Select all

      $logids = requestIntArray(batch);

is the ulprit and should be replaced by

Code: Select all

      $logids = requestIntArray('batch');



This did not work, sorry.

This is the sb_batch function in my index.php:

Code: Select all

function sb_batch() {
      global $oPluginAdmin;
      $logids = requestIntArray(batch);
      $action = requestVar('batchaction');
      //debug:   print_r ($logids);
      if ($logids) foreach ($logids as $id) {
         switch ($action) {
            case 'tspam':
            case 'tham':
               $ar = $oPluginAdmin->plugin->spambayes->nbs->getLogevent($id);
               $docid = $oPluginAdmin->plugin->spambayes->nbs->nextdocid();
               $cat = substr($action,1);
               $oPluginAdmin->plugin->spambayes->train($docid,$cat,$ar['content']);
               echo 'train '.$cat.': '.$id.'<br />';
               break;
            case 'delete':
               echo 'delete: '.$id.'<br />';
               $oPluginAdmin->plugin->spambayes->nbs->removeLogevent($id);
         }
      }

      echo '--end of batch--';
Blots of Info

http://www.golb.org
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Wed Nov 07, 2007 8:56 pm

Sounds like requestIntArray is broken, it should have said array ([0] => firstvalue, ....)
i've not installed 3.31 yet so I don't know if there are any differnce between these function in 3.24 and 3.31.
Will check later, need to paint the house first...
cyblot
Nucleus Guru
Nucleus Guru
Posts: 399
Joined: Tue Sep 16, 2003 8:49 pm
Location: Netherlands
Contact:

Postby cyblot » Wed Nov 07, 2007 10:13 pm

xiffy wrote:Sounds like requestIntArray is broken, it should have said array ([0] => firstvalue, ....)
i've not installed 3.31 yet so I don't know if there are any differnce between these function in 3.24 and 3.31.
Will check later, need to paint the house first...


No problem, I'm patient :) Enjoy the painting. I hope the movers picked up the boxes or painting will be quite a challenge.
Blots of Info

http://www.golb.org
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Wed Nov 07, 2007 10:45 pm

Okay, i've installed Spambayes on the demo site. It shows the same errors, so now i can test and figure out hwta is happening.
Later (and no i did not paint the hallway where all the boxes are piled against the ceiling)

Return to “Plugin Development”