NP_SpamBayes 1.1.0 done !

Post your new plugins here!
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

NP_SpamBayes 1.1.0 done !

Postby xiffy » Mon Sep 04, 2006 11:49 pm

All those interested in yet another spamfighting tool say: Aye!

I would like to announce NP_SpamBayes. This plugin introduces Bayesian filtering to your weblog. Hooking in on major events when comments or trackbacks are posted to your weblog.
The download link is now available and you should read more about this baby in the wiki: NP_SpamBayes. I started writing this plugin because Blacklist wasn't bulletproof anymore. I know there are other plugins, but I refuse to add captcha or javascript powered plugins. The current spammessages are not easily stopped by adding keywords to a list. After 1 day of extensive testing and a good corpus of ham and spam messages I did not have to delete 1 spam message today. SpamBayes missed 3 spams but they were catched by good old Blacklist.
So if you are interested please read the wiki page and the consider if this is the anti spam plugin for you.
warning, you should have a spam free blog when you start training the plugin

Expected time of arrival for the first zipfiles: wednesday 6 sept..
update
It's done. Go get your package: spambayes version 1.1.0 and remember. READ the wiki page! A non trained filter won't do you any good!
version 1.0.1 sees the light. No urgent need to upgrade if you've got version 1.0 installed. 1 small bug and 1 convenience added in the log screen (totals per category in the title)
version 1.0.2 has been born Lots of nice features added to the log facility so you can investigate spam and false positives efficiently.
version 1.0.3 has been born This version solves a small bug with logging. All older versions have logging enabled wheter you say yes or no to the logging option .. Also added the option to train all yet untrained comments. This way you can keep you ham filter fresh.
version 1.0.4 has been born Version 1.0.3 disabled all logging in PHP version 4. This has been fixed by this release, nothing else added. So if version 1.0.3 works, just leave it where it is (if it ain't broke, don't fix it ...) If you run PHP version 4, you should upgrade (just uploading the new release will suffice, no uninstall / install needed for upgrading.)
version 1.0.5 has been born Update probabilities now has been obsoleted. The numbres are now calculated after each training action automaticly. No other features are added. (just uploading the new release will suffice, no uninstall / install needed for upgrading.)
version 1.1.0 (beta) has been born Logging overhaul. Paging, number of items, explain option and promote to weblog. It's all there now.
Last edited by xiffy on Wed Jan 10, 2007 12:16 am, edited 13 times in total.
User avatar
roel
Nucleus Guru
Nucleus Guru
Posts: 4469
Joined: Tue Apr 16, 2002 12:41 am
Location: Rotterdam, The Netherlands
Contact:

Postby roel » Tue Sep 05, 2006 8:45 am

This sounds good!
You off course need a weblog with quite some comments to get reliable results. So that may not be helpful to new bloggers.

However, I setup a Nucleus 3.3 beta site on http://roelg.nl with Rakaz' anti-spam plugins and just left it there. And I haven't seen any spam there yet.

Together with NP_CommentCensor and the text-based captcha plugin we are getting some good defenses against comment spam. :)

(Btw, will this work for trackbacks too? And do you plan to plug it into the spamcheck api that 3.3 will provide?)

Thansk for all the hard work, Xiffy!
Is your question not solved yet?
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Tue Sep 05, 2006 10:16 am

If NP_SpamCheck has the same interface Rakaz and I first developed for Blacklist and NP_Referrer and TrackBack then the answer is yes, it works together with NP_SpamCheck. (I discovered this yesterday when I cleaned my referrer spam and Spambayes started to delete referrers before Blacklist did this ;-) )
And like I wrote in the Wiki, it all depends on training. So the more comments the better it is, what is best with Spam Bayes is that evenyually it becomes a filter for your site. No central repository. On my ducth site english comments are rare and 99.9% is spam. So I can train with more english words then someone with an english blog...
It's operational on my site for 2 days, it catched over 200 spam comments and I had only one coming through. Luckily Blacklist catched that one. And with one click I could train SpamBayes to never let that kind of comments get through.

Thursday ... (must sleep)
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Wed Sep 06, 2006 5:24 pm

Okay, I've been reading the extensice discussion started by Rakaz concerning the SpamCheck in version 3.3
At the moment this plugin is for Nucleus 3.23 and lower.
When 3.3 goes public, 1 code code change would suffice to let the new Spam api control the plugin. All that needs to be done is the removal of the preAddComment event and the validateForm event. They are needed because the current nucleus version hasn't got the SpamCheck event enabled in the core.

So yes, when 3.3 gets out, this plugin will have a 3.3 version as well.
Considering Trackback. I've (re) enabled trackback on my site again and Spam Bayes started to filter those immediatly as well. (If you have the latest Trackback by Rakaz or a self-modded Trackback like me which calls for "SpamCheck" when a trackback is posted).

So I think we are ready to bring spam figting to a new level with alle the anti spam plugins available.
User avatar
Leng
Nucleus Guru
Nucleus Guru
Posts: 2827
Joined: Sun Sep 19, 2004 2:34 am
Location: Australia
Contact:

Postby Leng » Sat Sep 09, 2006 2:51 am

Just installed the plugin! I've been getting lots of trackback spam recently, so here's to hoping it will cut down on that.

On a side note, when I use the "Spam Test" option, I get the following error message:

Code: Select all

Warning: Division by zero in /home/lenglui/public_html/nucleus/plugins/spambayes/spambayes.php on line 72

Line 72 merely checks to see if the admin area is turned on? Even turning on the quickmenu option still gives this error.

Edit: Hrmm...trying to send a message to myself through the member contact form now gives this error when logged in:

Code: Select all

Warning: Division by zero in /home/lenglui/public_html/nucleus/plugins/spambayes/spambayes.php on line 72

Warning: Cannot modify header information - headers already sent by (output started at /home/lenglui/public_html/nucleus/plugins/spambayes/spambayes.php:72) in /home/lenglui/public_html/nucleus/libs/globalfunctions.php on line 1175
Image
deborahlau.com | To-Do List
Questions? See the FAQ, read the docs, or browse our plugins!!
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Sat Sep 09, 2006 11:31 am

you did train spam bayes with some samples? You should see a wordcount greater then zero and a probability greater then zero for both ham and spam categories ...
Yep line 72 in spambayes/spambayes.php says it all:
it's a very small probability which is divided by the amount of words trained by the filter.

one side note for your consideration:
On my main blog i've a wordcount of:
Ham: 85960 Spam: 16100 and this filter is very effective (2 missed spams in a week, catched 6000 spams)
On another blog:
Ham: 696 Spam: 509 and to my amazement this one is even more effective. So you don't need a lot of data to get spam bayes running. This filter missed 0 spam and catched 333 spams. (less traffic)
In the docs (see wiki) is a lot of explaining done for training the filter ...
cyblot
Nucleus Guru
Nucleus Guru
Posts: 399
Joined: Tue Sep 16, 2003 8:49 pm
Location: Netherlands
Contact:

Postby cyblot » Sat Sep 09, 2006 12:40 pm

xiffy wrote:[...]what is best with Spam Bayes is that evenyually it becomes a filter for your site. No central repository.


Which is definitely the best way to approach spam, since we don't have to rely on anyone else maintaining a central file or service. As long as NP_SpamBayes itself keeps being updated to work with the latest Nucleus version of course :wink:

This sounds really good, I'm going to test it. One question while I do, does this mean comments won't show up until I have told Spam Bayes it is ham, or is it added to the site first, until I determine it is spam? I didn't see that info in your description, but maybe I just overlooked it.
Blots of Info
http://www.golb.org
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Sat Sep 09, 2006 12:46 pm

Ah, I did not metion because for me it was obvious (and that learns me that not all things obvious will be obvious for the rest of the world). Anything that is considered 'ham' will show up on your weblog as a legit comment / trackback. However comments will also be logged in the spam bayes log, if you have loggin turned on. So you can quickly train the filter to consider that particulair comment as spam. (I did not add 'ham' logging to the SpamCheck event because the amount of logged events could be overwhelming if you would use Spam Bayes for referrer blocking as well )
User avatar
Leng
Nucleus Guru
Nucleus Guru
Posts: 2827
Joined: Sun Sep 19, 2004 2:34 am
Location: Australia
Contact:

Postby Leng » Sat Sep 09, 2006 12:56 pm

xiffy wrote:you did train spam bayes with some samples? You should see a wordcount greater then zero and a probability greater then zero for both ham and spam categories ...
Yep line 72 in spambayes/spambayes.php says it all:
it's a very small probability which is divided by the amount of words trained by the filter.

Yup, I trained it with all the comments currently, but since there were no spam comments, there is a probability of 0 for spam.

Stuck in a couple of spam examples and now the error has disappeared. Yay! I'm now going to enable comments on my site without requiring registration to see how good SpamBayes is. For science!
Last edited by Leng on Sat Sep 09, 2006 1:08 pm, edited 1 time in total.
Image

deborahlau.com | To-Do List
Questions? See the FAQ, read the docs, or browse our plugins!!
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Sat Sep 09, 2006 1:00 pm

yes, just copy-paste some spam trackbacks that you would like to stop in the train text area.
you don't need a lot but at least 1 :-)
after that every comment / trackback that get's through. add it to the filter and after some time the spam will go away.
if you enable logging, training will be easier (the log will have links to train ham / spam)
cyblot
Nucleus Guru
Nucleus Guru
Posts: 399
Joined: Tue Sep 16, 2003 8:49 pm
Location: Netherlands
Contact:

Postby cyblot » Sun Sep 10, 2006 12:52 pm

I'm not sure at which point this should be taken from the announce-forum to another one, but ... I notice a small problem and have a question.

I notice that in the log, Clear oldest/newest 10 does not appear to do anything for me, but Clear complete log does work. With clear oldest/newest, I can press the button, but the items remain visible in the log page of my browser. The following error message is returned: "mySQL error with query delete from nucleus_plug_sb_log order by logtime limit 10: You have an error in your SQL syntax near 'order by logtime limit 10' at line 1"

And a question: I have tons of plugins installed, plenty of which to try to block spam. Is there a "best order" for these? I currently have the following order: CommentControl, SpamCheck, Blacklist, TrackBack, SpamBayes. Is it better to put SpamBayes first or does that not matter at all?
Blots of Info

http://www.golb.org
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Sun Sep 10, 2006 1:37 pm

the clear newest / oldest 10 seems only to work on mysql 4.0.4 and higher. It will however have no effect on lower versions. The only downside is that the buttons don't work.

The order is I guess a personal one. I have spambayes as first line of defense at this moment so it's the top most anti-spam plugin. I depends on which plugin does the best job for you at the moment.
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Fri Sep 15, 2006 3:10 pm

Updated the version to 1.0.2
15 sept 2006 : Small update (1.0.2)

All new functionality regards the log facility of SpamBayes. So upgrade is only relevant if you have loging enabled. There are now two types of filter that can be applied to your view. The known ‘ham’ and ‘spam’ filter, but the event type is added to the form to select only trackbacks, comments, referrers, mailtoafriends etc. The amount of types depends on the amount of different plugins actually call spambayes. So if you have trackback installed, it will be added as a logtype to the list. Same goes for mailtoafriend etc. Only types that have 1 or more logged events will show up on the list.
Delete logs is now reduced to two buttons; clear all or clear current filtered logs which will clear all trackback / referrer / comment spam from the logs while maintaining all the other logs for further investigation.

All this functionality has been added on my own behalf. I know i get a lot of spam (10.000 logged events on a weekly basis) and this way I can quickly scan all logged events to see if any false positives are inside on type or the other. The default log screen became unusable with over 200 events or so.


By the way, this version does solve the 'clear 10' problem (by removing those buttons 8) ) and gives you some other means to selectively erase some logged events ...

There is no upgrade hassle, no need to uninstall / install. Just unzip and upload to your server and the new functionality will be available to you.
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Fri Sep 15, 2006 4:01 pm

a side note.
maybe one of you Leng or Roel could update the faq about spamfighting? http://faq.nucleuscms.org/item/45
Remove blacklist completly form the reference? add akismet of rakaz (and i think some others as well) and spambayes?
User avatar
roel
Nucleus Guru
Nucleus Guru
Posts: 4469
Joined: Tue Apr 16, 2002 12:41 am
Location: Rotterdam, The Netherlands
Contact:

Postby roel » Fri Sep 15, 2006 8:21 pm

xiffy wrote:a side note.
maybe one of you Leng or Roel could update the faq about spamfighting? http://faq.nucleuscms.org/item/45
Remove blacklist completly form the reference? add akismet of rakaz (and i think some others as well) and spambayes?


SpamBayes and Akismet have been added. :)
pecanha
Posts: 3
Joined: Fri Jan 20, 2006 3:15 pm

Postby pecanha » Mon Sep 18, 2006 3:21 pm

When i try to Ham all the comments, I get lots of these errors:

mySQL error with query INSERT INTO nucleus_plug_sb_ref (ref, catcode, content) VALUES (6, 'ham','Peçanha é o cara mais paradoxal que existe. Ao mesmo tempo que ele é super mega over ocupado com 30 projetos diferentes ele é tão à toa que se dispõe a fazer adesivos para colar na parede da fafich e a fazer sites como esse. 201.19.186.119 201.19.186.119'): Duplicate entry '6' for key 1

And when I try to Manually Train it, it simply doesn't work, the probability and wordcount are always 0.

Did I do something wrong? When I installed it everything went fine!
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Mon Sep 18, 2006 3:40 pm

Most probably training went allright and did you try it a second time. That could be the reason for the mysql errors (the comment id is duped as the training id.)
Also manually training probably went well as well, what you need to do after a training session is click 'update probabilties' after that the wordcount is up to date again.
Last edited by xiffy on Tue Sep 26, 2006 10:46 pm, edited 1 time in total.
pecanha
Posts: 3
Joined: Fri Jan 20, 2006 3:15 pm

Postby pecanha » Mon Sep 18, 2006 3:56 pm

Well... it was basically that... ehehehe.... i feel a little stupid... ehehe

Thanks!
verbaljam
Posts: 666
Joined: Wed Jul 31, 2002 4:58 pm
Location: Amsterdam, The Netherlands
Contact:

Postby verbaljam » Tue Sep 19, 2006 4:14 pm

Some questions and some feedback:

- Sometimes I get during working with the plugin an 'Access denied' error from Nucleus. Also the stylesheet is not loaded. When I go to the admin home again, the problem is solved.

- I got a 'Fatal error' message (something with 60 seconds exceeded) when training Spambayes for the first time with all the comments in my blog.
However, I think that it picked up the majority of the comments, because the wordcount is very high. Maybe my weblog is too large (almost five years with comments).

- Some readers reported that they ended up at the spame page with a normal comment. After I trained the filter with that particular message as 'ham', Spambayes still said it was spam. How come?

- Is it a problem to feed Spambayes with duplicate messages while training (With dozens of spam messages every day, it's almost impossible to remember which messages already have been used for training)?

- Is it also possible (and usefull) to train the filter with email addresses from spammers (I noticed the NotifyMe plugin is abused by spammers that post email addresses)?

- Should the plugin be installed above all other anti-spam plugins?

Anyway, these where just some questions and remarks for feedback. My general opinion is, that it is a very usefull plugin. Great work, Xiffy! :!:
User avatar
xiffy
Nucleus Guru
Nucleus Guru
Posts: 1194
Joined: Wed Mar 27, 2002 6:37 pm
Location: Deventer
Contact:

Postby xiffy » Tue Sep 19, 2006 5:56 pm

note SpamBayes works on words. A word is considered not to contain dots (.), colons (:) etc. So an url http://www.mydomain.com/ consits of the words http, www, mydomain and com
- Sometimes I get during working with the plugin an 'Access denied' error from Nucleus. Also the stylesheet is not loaded. When I go to the admin home again, the problem is solved.

This happens when you feed SpamBayes words that are in your .htaccess (generated by Blacklist for instance). The blocked words are in the url so loading of the stylesheet is blocked. That is also an indication that the next action you do will result in a 403 error. I usually do 'back' in the browser at that point, that way you can continue working.
- I got a 'Fatal error' message (something with 60 seconds exceeded) when training Spambayes for the first time with all the comments in my blog.
However, I think that it picked up the majority of the comments, because the wordcount is very high. Maybe my weblog is too large (almost five years with comments).
You nailed it exactly. With over 2000 comments or so most servers would be executing over 60 seconds. The first 1500 or so comments should be trained though. Just maybe i add an extra option to the menu, train untrained comments as ham. That way you can train all comments (just by clicking it 3 or 4 times)
-- Some readers reported that they ended up at the spame page with a normal comment. After I trained the filter with that particular message as 'ham', Spambayes still said it was spam. How come?
That's the nature of Bayesian filtering. It's not black / white listing it's giving words a specific value. So if you trained your Spam rules with a lot of .com domains, any post containing .com will have a high value for spam. They need to type a lot of ham words to get the value down. Now adding the comment to the ham rules will lower the .com score a little but it will not make it ham quickly. Which leeds to your next question:
Is it a problem to feed Spambayes with duplicate messages while training (With dozens of spam messages every day, it's almost impossible to remember which messages already have been used for training)?

No, it's not a problem, in fact it can be good to give some spam words a higher word value and equally train some ham words with a higher word value. I think it's hard to give an insight on the test page to see what words rendered the highscore, but sometimes i wish it was visible. As a tip. I put all thre mayor top level domains 'org' 'com' and 'net' in my ignore list because spammers try to push url's So the spam score for these three are gigantic while the domain, not the top-level domain should be weighed. This solved a lot of false-positives for me. But still some legit comments nedded a three or four times training to get a ham score.
- Is it also possible (and usefull) to train the filter with email addresses from spammers (I noticed the NotifyMe plugin is abused by spammers that post email addresses)?
By all means. But be aware that you should tell spambayes to ignore 'hotmail' and 'gmail' for instance otherwise the spam score for these domains would be too high to beat for legit users as well.

Return to “Plugin Development”