Thursday, 1 September 2016

Archiwum polskiego usenetu


[Aktualizacja 18.11.2016: Poprawione zostały polskie znaki w opisach niektórych grup i przywrócono widoczność niedostępnych wcześniej wiadomości.]

Pod powyższymi odnośnikami znaleźć można najbardziej kompletne archiwum polskich grup dyskusyjnych (usenetu, newsów). Można je odczytać za pomocą czytnika ze strony

Archiwum zostało utworzone w czerwcu i lipcu 2016 roku, przy użyciu następujących źródeł:
Najstarsze dostępne wiadomości pochodzą z 1996 roku. Niestety, mimo użycia wielu źródeł, część wiadomości w dalszym ciągu zagubiona jest w pomroce dziejów (ewentualnie w Google Groups, ale to w sumie bez różnicy).

Archiwum przetworzone zostało za pomocą narzędzi wchodzących w skład Usenet Archive Toolkit:
  • Nie są przechowywane duplikaty wiadomości.
  • Wszystkie grupy zostały potraktowane filtrem od-spamującym (sprawdzane były tylko wiadomości, które zaczynały wątek i pozostały bez odpowiedzi).
  • Wiadomości zostały przekonwertowane na UTF-8, z uwzględnieniem większość problemów związanych z nieprawidłowym stosowaniem standardów przez czytniki, złymi, bądź brakującymi deklaracjami kodowania znaków, itp.
  • Dostępny jest obliczony wcześniej graf zależności między wiadomościami (struktura wątkowania). Jeżeli to możliwe, uwzględnione są również zależności wynikające wprost z cytatów (w przypadku gdy brak jest odpowiednich nagłówków). Jest to szczególnie pomocne w przypadku grup, które były połączone z listami mailingowymi, bądź z FidoNetem.
  • Dostępna jest również możliwość przeszukiwania wiadomości.

Usenet Archive Toolkit


Usenet Archive Toolkit project aims to provide a set of tools to process various sources of usenet messages into a coherent, searchable archive.


Usenet is dead. You may believe it's not, but it really is.

People went away to various forums, facebooks and twitters and seem fine there. Meanwhile, the old discussions slowly rot away. Google groups is a sad, unusable joke. dataset, at least with regard to polish usenet archives, is vastly incomplete. There is no easy way to get the data, browse it, or search it. So, maybe something needs to be done. How hard can it be anyway? (Not very: one month for a working prototype, another one for polish and bugfixing.)


Why use UAT? Why not use existing solutions, like google groups, archives from or NNTP servers with long history?
  • UAT is designed for offline work. You don't need network connection to access data in "the cloud". You don't need to wait for a reply to your query, or, god forbid, endure "web 2.0" interfaces.
  • UAT archives won't suddenly disappear. You have them on your disk. Google groups are deteriorating with each new iteration of the interface. Also, google is known for shutting down services they no longer feel viable. Google reader, google code search, google code, etc. Other, smaller services are one disk crash away from completly disappearing from the network.
  • UAT archive format is designed for fast access and efficient search. Each message is individually compressed, to facilitate instant access, but uses whole-archive dictionary for better compression. Search is achieved through a database similar in design to google's original paper. Total archive size is smaller than uncompressed collection of messages.
  • Multiple message sources may be merged into a single UAT archive, without message duplication. This way you can fill blanks in source A (eg. NNTP archive server) with messages from source B (eg. much smaller dump). Archives created in such way are the most complete collection of messages available.
  • UAT archives do not contain duplicate messages (which is common even on NNTP servers), nor stray messages from other groups ( collections contain many bogus messages).
  • Other usenet archives are littered with spam messages. UAT can filter out spam, making previously unreadable newsgroups a breeze to read. Properly trained spam database has very low false positive and false negative percentage.
  • All messages are transcoded to UTF-8, so that dumb clients may be used for display. UAT tries very hard to properly decode broken and/or completly invalid headers, messages without specified encoding or with bad encoding. HTML parts of message are removed. You also don't need to worry about parsing quoted-printable content (most likely malformed). And don't forget about search. Have fun grepping that base64 encoded message without UAT.
  • UAT archives contain precalculated message connectivity graph, which removes the need to parse "references" headers (often broken), sort messages by date, etc. UAT can also "restore" missing connectivity that is not indicated in message headers, through search for quoted text in other messages.
  • Access to archives is available through a trivial libuat interface.
  • UAT archives are mapped to memory and 100% disk backed. In high memory pressure situations archive pages may just be purged away and later reloaded on demand. No memory allocations are required during normal libuat operation, other than:
    • Small, static growing buffer used to decompress single message into.
    • std::vectors used during search operation.

Toolkit description

UAT provides a multitude of utilities, each specialized for its own task. You can find a brief description of each one below.

Import Formats

Usenet messages may be retrieved from a number of different sources. Currently we support:
  • import-source-slrnpull --- Import from a directory where each file is a separate message (slrnpull was chosen because of extra-simple setup required to get it working).
  • import-source-slrnpull-7z --- Import from a slrnpull directory compressed into a single 7z compressed file.
  • import-source-mbox --- keeps its collection of usenet messages in a mbox format, in which all posts are merged into a single file.
Imported messages are stored in a per-message LZ4 compressed meta+payload database.

Data Processing

Raw imported messages have to be processed to be of any use. We provide the following utilities:
  • extract-msgid --- Extracts unique identifier of each message and builds reference table for fast access to any message through its ID.
  • extract-msgmeta --- Extracts "From" and "Subject" fields, as a quick reference for archive browsers.
  • merge-raw --- Merges two imported data sets into one. Does not duplicate messages.
  • utf8ize --- Converts messages to a common character encoding, UTF-8.
  • connectivity --- Calculate connectivity graph of messages. Also parses "Date" field, as it's required for chronological sorting.
  • threadify --- Some messages do not have connectivity data embedded in headers. Eg. it's a common artifact of using news-email gateways. This tool parses top-level messages, looking for quotations, then it searches other messages for these quotes and creates (not restores! it was never there!) missing connectivity between children and parents.
  • repack-zstd --- Builds a common dictionary for all messages and recompresses them to a zstd meta+payload+dict database.
  • repack-lz4 --- Converts zstd database to LZ4 database.
  • package --- Packages all databases into a single file. Supports unpacking.

Data Filtering

Raw data right after import is highly unfit for direct use. Messages are duplicated, there's spam. These utilities help clean it up:
  • kill-duplicates --- Removes duplicate messages. It is relatively rare, but data sets from even a single NNTP server may contain the same message twice.
  • filter-newsgroups --- Some data sources (eg.'s giganews collection) contain messages that were not sent to the collection's newsgroup. This utility will remove such bogus messages.
  • filter-spam --- Learns which messages look like spam and removes them.
Search in archive is performed with the help of a word lexicon. The following tools are used for its preparation:
  • lexicon --- Build a list of words and hit-tables for each word.
  • lexopt --- Optimize lexicon string database.
  • lexstats --- Display lexicon statistics.
  • lexdist --- Calculate distances between words (unused).
  • lexhash --- Prepare lexicon hash table.
  • lexsort --- Sort lexicon data.

Data Access

These tools provide access to archive data:
  • query-raw --- Implements queries on LZ4 database. Requires results of extract-msgid utility. Supports:
    • Message count.
    • Listing of message identifiers.
    • Query message by identifier.
    • Query message by database record number.
  • libuat --- Archive access library. Operates on zstd database.
  • query --- Testbed for libuat. Exposes all provided functionality.

End-user Utilities

  • browser --- Graphical browser of archives.

Future work ideas

Here are some viable ideas that I'm not really planning to do any time soon, but which would be nice to have:
  • Implement messages extractor, for example in mbox format. Would need to properly encode headers and add content encoding information (UTF-8 everywhere).
  • Implement a read-only NNTP server. Would need to properly encode headers and add content encoding information. 7-bit cleanness probably would be nice, so also encode as quoted-printable. Some headers may need to be rewritten (eg. "Lines", which most probably won't be true, due to MIME processing). Message sorting by date may be necessary to put some sense into internal message numbers, which currently have no meaning at all.
  • Implement pan-group search mechanism.
  • Query google groups for missing messages present in "references" header.


Usenet Archive Toolkit operates on a couple of distinct databases. Each utility requires a specific set of these databases and produces its own database, or creates a completly new database indexing schema, which invalidates rest of databases.

slrnpull directory → import-source-slrnpull → produces: LZ4
slrnpull compressed → import-source-slrnpull-7z → produces: LZ4
mbox file → import-source-mbox → produces: LZ4
LZ4kill-duplicates → produces: LZ4
LZ4extract-msgid → adds: msgid
LZ4, msgidconnectivity → adds: conn
LZ4, connfilter-newsgroups → produces: LZ4
LZ4, msgid, conn, strfilter-spam → produces: LZ4
LZ4extract-msgmeta → adds: str
(LZ4, msgid) + (LZ4, msgid) → merge-raw → produces: LZ4
LZ4utf8ize → produces: LZ4
LZ4repack-zstd → adds: zstd
zstdrepack-lz4 → adds: LZ4
LZ4, connlexicon → adds: lex
lexlexopt → modifies: lex lexlexhash → adds: lexhash
lexlexsort → modifies: lex
lexlexdist → adds: lexdist (unused)
lexlexstats → user interaction
LZ4, msgidquery-raw → user interaction
zstd, msgid, conn, str, lex, lexhashlibuat → user interaction
everything but LZ4packageone file archive
everything but LZ4threadify → modifies: conn, invalidates: lex, lexhash

Additional, optional information files, not created by any of the above utilities, but used in user-facing programs:
  • name --- Group name.
  • desc_short --- A short description about the purpose of the group (per 7.6.6 in RFC 3977).
  • desc_long --- Group charter. (Some newsgroups regularly post a description to the group that describes its intention. These descriptions are posted by the people involved with the newsgroup creation and/or administration. If the group has such a description, it almost always includes the word "charter", so you can quickly find it by searching the newsgroup for that word. A charter is the "set of rules and guidelines" which supposedly govern the users of that group.)


Be advised that some utilities (repack-zstd, lexicon) do require enormous amounts of memory. Processing large groups (eg. 2 million messages, 3 GB data) will swap heavily on a 16 GB machine.

utf8ize doesn't compile on MSVC. Either compile it on cygwin, or have fun banging glib and gmime into submission. Your choice.

UAT only works on 64 bit machines.



Wednesday, 27 January 2016

etcpak 0.5

etcpak strikes again, this time with version 0.5, which has the ability to calculate planar blocks from the ETC2 standard. Color gradients, which were a sore spot in the image quality previously, will now have a much smoother look. This new option is activated by passing -etc2 parameter and comes at a small time cost (152% of pure ETC1 mode, 77 ms vs 117 ms). Example compressed image:

Planar block count in this image is quite high, as can be seen on the following debug image, where blue color indicates planar mode:

It should be noted that AVX2 version of planar block compression does not produce the same results as scalar one. Keep that in mind on pre-Haswell machines.


Saturday, 2 January 2016

etcpak 0.4

New year, new etcpak. Previously etcpak was an order of magnitude faster than the competing ETC compressors. This new version is yet another order of magnitude faster.

Time to compress 8K image:
etcpak 0.3: 655 ms
etcpak 0.4: 77 ms

This is thanks to Daniel Jungmann, who submitted patches implementing SSE 4.1 and AVX2 instructions. SSE 4.1 is now default and required (supported by Core 2 and AMD Bulldozer CPUs) and AVX2 is detected at runtime (support added in Haswell CPUs).

Other new features:
  • ETC1 dissection mode, allowing in-depth inspection of compressed images.
  • Mip-maps will be calculated only if needed (doh!).
  • Alpha channel compression is now deterministic.
  • Minor changes in compression precision. This will change checksums of most compressed images, compared with previous versions.
  • Multithreaded job dispatch algorithm was improved.
  • zlib/PNG checksum validation was removed, resulting in 12% improvement in image load times.

Tuesday, 16 September 2014

Review of two bad joysticks

Logitech Attack 3

I don't even know why I have it. Doesn't have twist handle, nor any hat, so it's likely to be never actually used by me. Let's see some customer reviews:
So it's quite good, right?

Lol nope. Have fun trying to fly something with this absolute turd.

Madcatz and/or Saitek Cyborg F.L.Y. 5

Cheap(ish) and I think a long time ago the Internets said it was good. Let's see.

Definitely better than that shit above, but still not quite good. Discontinuities at centers of axes and a massive dead zone. As for the twist axis:

That's a constant rotary motion, without any pauses at center. When you combine it with bad build quality you get something you don't really want to have.

Saturday, 6 September 2014

Glany Heavy Duty - co jest w środku?

Miałem buty. Chyba HD 863. Pękła podeszwa i zaczęły siorbać wodę. Nie pamiętam po ilu latach noszenia, ale pewnie coś koło 3-4. Potrzebowałem kawałka dobrej skóry, a w lumpeksach się nie orientuję, więc trzeba było korzystać z tego co jest.

Po spruciu listwy. Usztywnienie zapiętka wykonane ze skóry, wklejone.

Podkładka pod nity. Szyta i klejona.
Podkładki po odpruciu.
Język po odpruciu od obłożyny.
Obłożyna odpruta od przyszwy.
Widok z boku.
Język odpruty od przyszwy.
Podeszwa środkowa z jakiegoś filcu, kompletnie zalana klejem. Sztywna i twarda.
Widok do środka.
Obłożyna po odcięciu.
Kolejne warstwy zapiętka.
Odcięta przyszwa.
Odcięty nosek.
Pęknięcie podeszwy. Zawinięte kawałki skóry są przybity gwoździkami i przyszyte do podeszwy środkowej. Przy okazji widać ile warte są śmiech-śrubki w podeszwie i śmiech-przeszycie na otoku.
Po rozcięciu podeszwy, przeszycia i wyciągnięciu gwoździka.
Grubość skóry.
Drugie życie buta.

Friday, 30 May 2014

etcpak 0.3

New major version, new features:
  • Ability to create mipmaps (only POT, not in benchmark mode).
  • Optional dithering of input image.
  • Small quality improvements at basically no cost.
Image minification algorithm used for generating mipmaps is stupid simple, but it already beats the implementation in PVRTexTool:
Left: PVRTexTool, Right: etcpak
Notice the high frequency artifacts present in the PVRTexTool image, particularly near the eyes of parrots. etcpak generates smoother and more natural look. Further refinements will be able to improve the image quality even more.

Dithering basically improves the appearance of gradients or smooth areas in photos:
Left: no dithering, Right: dithering enabled
It comes at a small cost however. Here are the timings for normal compression:
$ x64/Release/etcpak.exe 8192.png -b
Image load time: 1352.646 ms
Mean compression time for 50 runs: 630.855 ms
And this is the run with dithering enabled:

$ x64/Release/etcpak.exe 8192.png -b -d
Image load time: 1312.084 ms
Mean compression time for 50 runs: 744.394 ms