The original words of Phanes, tirelessly carved into a slab of "No'".

IRCTHULU ON THE HORIZON

I’m pleased to announce the Tenta client came together quite well, marking the completion (and activation) of the data backend for IRCTHULU.

For those of you who haven’t been following this project, this is how IRCTHULU works:

Tenta is a pair of services comprised of the runner client, tentacle and producer service, called nerve. They run in the background of the systems of the people who collect the data, and are installable as an RPM.  The “transporter service” called nerve pumps the messages captured by tentacle into a message queue I currently own.  Tentacle is configured by the runner and can point to any channel they wish.  I can make recommendations about what channel but I can not decide ultimately.

From there a multi-threaded service called synapse in the storage layer empties out the MQ, transforms, and pushes the messages into a database.  From there, an API has readonly access to the databases to serve information relevant to supply a user interface on a remote site.  The MQ/DB/API combination is a layer called Data.

The UI will be called Presenta.

So that’s Tenta, Data, Presenta.

What does performance look like?

The total throughput of the system was last measured at 30,000 messages per second, allowing very many channels to be logged.

What kind of Data does it record?

Here’s a data sample:

{"type": "pubmsg", "nick": "Mihaita", "channel": "#freenode", "message": "Phanes @robot?", "timestamp": "2017-12-14 03:27:36.762231", "host": "166.ip-54-37-156.eu", "ident": "~SirNeo"}
{"type": "pubmsg", "nick": "Phanes", "channel": "#freenode", "message": "SirNeo-, yeah its part of a project called ircthulu, its all google indexed eventually", "timestamp": "2017-12-14 03:28:17.214279", "host": "surro/founder/phanes", "ident": "~Phanes"}
{"type": "pubmsg", "nick": "SirNeo-", "channel": "#freenode", "message": "i will google it about latter", "timestamp": "2017-12-14 03:29:09.333669", "host": "166.ip-54-37-156.eu", "ident": "~SirNeo"}
{"type": "pubmsg", "nick": "Phanes", "channel": "#freenode", "message": "SirNeo-, its not public yet", "timestamp": "2017-12-14 03:29:16.679930", "host": "surro/founder/phanes", "ident": "~Phanes"}
{"type": "join", "nick": "[skyline__O__]", "channel": "#freenode", "message": "", "timestamp": "2017-12-14 03:29:22.442250", "host": "1.180.74.58", "ident": "~Shen"}
{"type": "join", "nick": "blkshp", "channel": "#freenode", "message": "", "timestamp": "2017-12-14 03:29:43.017512", "host": "about/windows/staff/blkshp", "ident": "~blkshp"}
{"type": "join", "nick": "SirNeo-", "channel": "#freenode", "message": "", "timestamp": "2017-12-14 03:29:49.518446", "host": "unaffiliated/sirneo", "ident": "~SirNeo"}
{"type": "pubmsg", "nick": "Phanes", "channel": "#freenode", "message": "SirNeo-, it uses a \"guerilla\" pattern for data collection points from disparate, distributed hosts via a background system process installed via rpm, so there's like no way to tell who it is and it doesn't log any events that could be used to detect the bot through the logs, also waits a random amount of time after starting to begin logging", "timestamp": "2017-12-14 03:31:17.932267", "host": "surro/founder/phanes", "ident": "~Phanes"}
{"type": "pubmsg", "nick": "SirNeo-", "channel": "#freenode", "message": "Phanes, how do you know we don't use allready a VPS?", "timestamp": "2017-12-14 03:31:20.409542", "host": "unaffiliated/sirneo", "ident": "~SirNeo"}
{"type": "pubmsg", "nick": "Phanes", "channel": "#freenode", "message": "channel redundancy", "timestamp": "2017-12-14 03:31:52.678278", "host": "surro/founder/phanes", "ident": "~Phanes"}
{"type": "pubmsg", "nick": "Phanes", "channel": "#freenode", "message": "autorecover", "timestamp": "2017-12-14 03:32:05.244805", "host": "surro/founder/phanes", "ident": "~Phanes"}
{"type": "pubmsg", "nick": "Phanes", "channel": "#freenode", "message": "bug tested for \"gotchas\" to kill the bots", "timestamp": "2017-12-14 03:32:21.725860", "host": "surro/founder/phanes", "ident": "~Phanes"}
{"type": "pubmsg", "nick": "Phanes", "channel": "#freenode", "message": "the metadata separation is the cool part though", "timestamp": "2017-12-14 03:32:53.327834", "host": "surro/founder/phanes", "ident": "~Phanes"}
{"type": "join", "nick": "skyline__O__", "channel": "#freenode", "message": "", "timestamp": "2017-12-14 03:33:03.015222", "host": "1.180.74.58", "ident": "~Shen"}

It’s recording the message type, although some message types are restricted (like some CTCPs or DCC) from logging, the channel the message occured on, the message itself if there is one, timestamp of the message (server), the host of the user submitting the message (or cloak if they have one), and the ident value associated with that host at the time of the message submission.  I will also tonight be adding in an even handler for nick changes to capture that metadata for nick changes.

That’s oddly specific.  What purpose do these datapoints serve?

Great question.  This will allow the UI to serve as a search engine of sorts, allowing you to browse logs, and, identify sock puppets, or trends, or even perform statistical research or machine learning algorithms upon for detecting anomalies.  I’ll explain one example use case:

A user FrankG likes to use sock puppets to harass people online.  He connects with FrankG!dirtyfrank@224.223.221.33 normally.  You suspect that he has goofed at some point and used the same ident on his irc client or host at some point breaking OPSEC.  This would easily be determined by simply looking at a log that captured frank talking by searching for username FrankG or going to a known channel and time you knew he was speaking.  From there you can click on his ident, dirtyfrank and see all the users captured who used that ident.  Or you can click on his host 224.223.221.33 and see all the users ever having connected and posted a message in a logged channel from that host.  Or you can click on his username, FrankG and see all the hosts and idents that have ever been used by the username and begin doing deepe analytics.  Obviously, the larger the dataset, the more effective it is.

This is in addition to a more important community need being filled:  IRC is a lost trove of knowledge for people looking to increase their knowledge in a technical area — these channels are like living textbooks for people solving problems, particularly on open source systems.  By providing this resource for as many channels as possible we are opening up those channels to google search indexing, allowing someone looking for a problem a greater likelihood of finding their answer when googling.  Cool amirite?

What does the legal model look like?

Great question.

I own a database cluster that stores the feeds and the UI, which are both on different domains and can be decoupled at any time and even given to someone else.  Both the database layer and the UI publish only the feeds that are given to it and all of the feeds that are given to it.  The feeds are public domain.

The runners are, of course, expected to comply with Terms of Use and EULA for any networks they connect to and I trust them not to create liabilities for themselves with it.  Ultimately this is a discussion between the runners and the network owners.

In terms of IP, I’m keeping it closed source, for now, but obviously if it serves my interests I can make it open source and release under AGPL.  My only concern is that if I do that it will be abused and create many spinoffs that I can not control.  And, obviously, if I have to transfer ownership of the DB or UI layers I’ll lose any ability to control content and will have to release the source.

What’s left to do before it goes public?

I’ve got some adjustments to make in the transformation layer of the synapse component to really polish it out and then it’s the API, UI, and SEO.

How do channels “opt out”?

Some channels or networks may desire not to be logged and this is certainly understandable.

They can simply email me at punches.chris@gmail.com and request it, stating why.

While a reasonable justification is required, the overwhelming majority of channels will be approved with no issue; some will be expected to make adjustments to qualify and some in particular should view those adjustments as sacrifices.

IRCThulu, ultimately, is a tentacle monster by design, it’s the size of a building, and it’s destroying reality on IRC.  Well, not all of reality, just a particular kind.

Next Post

Previous Post

Leave a Reply

© 2021 Phanes' Canon

The Personal Blog of Chris Punches