Projects & Learnings

2026-02-05, 2026-04-01

PDFs, faces, and clustering

epstein

face recognition

pdf

So, the FBI released another bunch of the Epstein files: hundreds of the PDF documents with the emails, flight logs, chat extracts and photos - too much for me to handle, especially when I don't have a powerful GPU that can run advanced multimodal LLMs.

But I have plenty of space and these preparations were really helpful:

Split PDF documents into text and pictures.
Detect faces: find photos with faces.
Generate vectors for faces and cluster them.

Step	Number of files	Duration
Splitting PDFs	346485 PDFs → 557466 pngs	~10 minutes
Finding faces	557466 pngs -> 1022 pngs with faces	~16 hours
Clustering faces	1022 pngs -> 92 clusters	~3 minutes

Both face detection and clustering are straightforward and could be done by simply using face_recognition (pypi) and DBSCAN from scikit-learn. The only tuning I did was adding n_jobs=-1 to let DBSCAN use all the CPU cores.

Initial issues with the scripts

The first version of the script was not very good. It had some obvious things that could be considered issues, like copying files instead of symlinking, but I had reasons to do it that way. But there were things I didn't notice:

Same person appears multiple times on one picture: mirrors, newspaper and document scans
Too many false positives, when some random spots were mistaken for faces
Keeping findings in the JSONs made it difficult to analyze and serve the results

So now I have a better version: instead of copying files and writing the metadata to JSONs, it stores those in an SQLite database, without copying or symlinking the files. I use the database to make queries like "what people appear on the same photos as this person?"

The scripts: split_pdf.py, find_faces.py. db.py.

2026-01-25

A note on integrating self-hosted AI models

ollama

firejail

docker

selfhosted

Prev: Note on self-hosted isolated AI

It's quite easy to isolate an Ollama instance: just run it inside a firejail sandbox with network disabled. If you need a script that interacts with the model, you can run it in the same sandbox wit the "--join" flag. But what if you want to make Ollama available for other applications like IDEs or agent, while keeping it isolated from the internet?

Docker

The simplest way to achieve this is to run Ollama inside a docker container with an internal network (that will block internet access by default) and create a socat gateway which can see both the internal and host networks.

networks: ollama-local: driver: bridge # internal: true external: driver: bridge services: ollama: image: ollama/ollama:rocm environment: - OLLAMA_DEBUG=1 - OLLAMA_NUM_THREADS=15 # nproc - 1 volumes: - ./ollama_home:/root/.ollama devices: - /dev/kfd - /dev/dri networks: - ollama-local gateway: image: alpine/socat:latest command: "TCP-LISTEN:11434,fork,reuseaddr TCP:ollama:11434" depends_on: - ollama ports: - "11434:11434" networks: - ollama-local - external

docker-compose.yaml Docker compose to run an isolated Ollama instance.

Firejail with a limited network access

Less overhead, but more manual setup and it also requires extra permissions to apply the firewall configurations. While the firejail profile stays simple, you will also need a netfilter ruleset and the script that sets up and removes a bridge interface.

OLLAMA_HOST=10.10.20.2 firejail --profile=./ollama.profile ./ollama serve

Starting ollama: same as before, but with the custom host

The script is not universal, so pay attention to what you paste and check if it interferes with your existing network.

#!/bin/bash set -euo pipefail IFNAME="enp1s0f0" BRNAME="firebridge" BRADDR="10.10.20.2" if [[ "$1" == "up" ]]; then brctl addbr $BRNAME ip addr add 10.10.20.1/24 dev $BRNAME ip link set $BRNAME up iptables -t nat -A POSTROUTING -o $IFNAME -s 10.10.20.0/24 -j MASQUERADE iptables -t nat -A OUTPUT -m addrtype --src-type LOCAL --dst-type LOCAL -p tcp --dport 11434 -j DNAT --to-destination $BRADDR:11434 iptables -t nat -A POSTROUTING -m addrtype --src-type LOCAL --dst-type UNICAST -p tcp -d $BRADDR --dport 11434 -j MASQUERADE sysctl -w net.ipv4.conf.all.route_localnet=1 elif [[ "$1" == "down" ]]; then iptables -t nat -D POSTROUTING -o $IFNAME -s 10.10.20.0/24 -j MASQUERADE iptables -t nat -D OUTPUT -m addrtype --src-type LOCAL --dst-type LOCAL -p tcp --dport 11434 -j DNAT --to-destination $BRADDR:11434 2>/dev/null iptables -t nat -D POSTROUTING -m addrtype --src-type LOCAL --dst-type UNICAST -p tcp -d $BRADDR --dport 11434 -j MASQUERADE 2>/dev/null sysctl -w net.ipv4.conf.all.route_localnet=0 ip link set $BRNAME down brctl delbr $BRNAME else echo "Usage: $0 {up|down}" exit 1 fi

bridge.sh Script to set up/clean up a bridge and packet forwarding.

A strict firejail profile: use custom home folder and custom network configuration:

name ollama net firebridge ip 10.10.20.2 netfilter ./ollama.netfilter private /home/arusakov/devel/c2c/local-agent/local-agent-firejail/ollama_home

ollama.profile Firejail profile to run an isolated Ollama instance.

Netfilter profile: block everything except the default ollama port:

*filter :INPUT DROP [0:0] :FORWARD DROP [0:0] :OUTPUT DROP [0:0] -A INPUT -i lo -j ACCEPT -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT -A INPUT -s 10.10.20.1 -p tcp --dport 11434 -m conntrack --ctstate NEW -j ACCEPT -A OUTPUT -o lo -j ACCEPT -A OUTPUT -d 10.10.20.1 -j ACCEPT -A OUTPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT COMMIT

ollama.netfilter Netfilter ruleset for the firejail profile.

2025-10-02

A note on self-hosted isolated AI models

ollama

firejail

selfhosted

Next: Note on self-hosted AI integration

It's always better to have a machine that is completely disconnected from the internet when you work on some sensitive data. Configuring a firewall ruleset or a virtual machine with a proper passthrough settings is also a good option. But if it's not a top secret intelligence data, there is a simpler option with an acceptable overhead and privacy level: firejail with a shared network:

Quick example: firejail with network disabled

Ollama sandbox and client share a network, so they can only connect to each other:

# Let ollama to save models into our home directory export OLLAMA_MODELS=/path/to/our/ml-models/ollama/ # Installing models (if needed): path/to/ollama pull llava3 # Or llama3.2-vision, gemma3, etc # Start ollama in a sandbox with a custom sandbox name "ollama" firejail --noprofile --net=none --name=ollama path/to/ollama serve # Join an existing sandbox and sending a command to ollama firejail --noprofile --net=none --join=ollama \ curl -X POST http://localhost:11434/api/chat \ -H "Content-Type: application/json" \ -d '{...}'

ollama-run.txt Installing ollama, no network restrictions here.

2025-09-02

A note on deobfuscation

java

android

reverse engineering

Using MAX messenger for an autopsy is a kind of special thing now and while the messenger itself could be boring and unoriginal, the process could be still educational.

Comparing two versions

The defpackage folder after the jadx still contains small files with random garbage names and you typically cannot do much about it. Identifying files or classes that are identical except for their names is simple, but the sheer volume makes manual or semi-automated (diff between each two files) approach impractical. My approach, while still straightforward, was quite effective:

Load all files from both versions into memory: it was still less than several gigabytes. This step can be skipped if you have a quick enough SSD.
Calculate locality sensitive hash and group candidates based on the hash similarity.
Use a more precise similarity metric to confirm matches.

Comparison of the two defpackage folders with ~30000 files now takes roughly a minute, but without LSH filtering and parallelizing the same task takes more than an hour.

#!/usr/bin/env python3
import json
import os
import rapidfuzz
import sys
from datasketch import MinHash, MinHashLSH
from multiprocessing import Pool, cpu_count

nprocs = cpu_count()
verA = [os.path.join(sys.argv[1], x) for x in os.listdir(sys.argv[1])]
verB = [os.path.join(sys.argv[2], x) for x in os.listdir(sys.argv[2])]


def calcMH(data, num_perm=128, shingle_size=5):
    m = MinHash(num_perm=num_perm)
    for i in range(len(data) - shingle_size + 1):
        shingle = data[i : i + shingle_size]
        m.update(shingle)
    return m


print(f"Loading all files into memory: {len(verA + verB)}", flush=True)
content = {p: open(p, "rb").read() for p in verA + verB}

print("Calculating hashes", flush=True)
pairs = {}
lsh = MinHashLSH(threshold=0.5, num_perm=128)
with Pool(nprocs) as pool:
    hashes = pool.map(calcMH, [content[p] for p in verA + verB])
    for i, h in enumerate(hashes):
        if i < len(verA):
            lsh.insert(verA[i], h)
            continue
        matches = lsh.query(h)
        if matches:
            pairs[verB[i - len(verA)]] = matches

print("Performing comparisons")

def fileInfo(path):
    return {
        "path": path,
        "size": os.path.getsize(path)
    }

def bestMatch(data):
    verB = data[0]
    toCompare = data[1]
    result = (None, 0)
    for tc in toCompare:
        tcRate = rapidfuzz.fuzz.ratio(content[verB], content[tc])
        if tcRate > result[1]:
            result = (tc, tcRate)
    return verB, result[0], result[1]


matches = []
with Pool(nprocs) as pool:
    matches = pool.map(bestMatch, pairs.items())

result = []
for verB, verA, rate in matches:
    result.append(
        {
            "rate": rate,
            "versions": [
                fileInfo(verA),
                fileInfo(verB),
            ]
        }
    )

with open(sys.argv[3], "w") as f:
    json.dump(result, f)
print("Done")

find_matches.py contents.

Hints on real class names

Obfuscators can shuffle the code and randomize the names, yet the subtle clues that reveal the true class and field names:

public final String toString() {
    return "NetworkState(isConnected=" + this.a + ", isValidated=" + this.b + ", isMetered=" + this.c + ", isNotRoaming=" + this.d + ')';
}

toString method of the class "xn9" of Max 25.8.1

And "xn9.java" becomes "NetworkState.java", "a" becomes "isConnected" and so on. And don't forget about name collisions when classes from different packages end up in the same defpackage.

2025-08-07

Newer Python and yt-dlp for the Sailfish OS

Sailfish OS

Python

yt-dlp

Jolla

I moved to the Redeer C2, community edition with Sailfish OS. Not 100% voluntarily, but not against my will either. My phone's touch screen just broke, and I needed a replacement.

I still think that Sailfish OS is a nice system, and its user interface is the only mobile Linux UI that is customer-ready. Unfortunately, the system lacks many common applications, such as a video player (the music player exists and is okay for my needs) or a YouTube client. So my plan was to download videos with yt-dlp and view them locally.

Part 1: yt-dlp Python

What could be simpler than downloading and running yt-dlp from github or installing from pip? But my phone only had Python 3.8 (python3-base-3.8.18) and yt-dlp dropped support for it almost a year ago (#1132).

I was not looking for an easy way, so I decided to cross-compile Python myself. But I was not looking for a dirty way either, so I decided to create an RPM package that could be installed independently from the original Python and removed later if needed.

Building and packaging

Let's assume you have already followed the Sailfish SDK installation instructions and have sfdk installed. Make sure to do all the stuff within a workspace, because Sailfish SDK uses a VirtualBox VM or Docker container as a build host and mounts workspace directory there.

~/devel/sailfish $ sfdk tools list
SailfishOS-5.0.0.62              sdk-provided,latest
├── SailfishOS-5.0.0.62-aarch64  sdk-provided,latest
├── SailfishOS-5.0.0.62-armv7hl  sdk-provided,latest
└── SailfishOS-5.0.0.62-i486     sdk-provided,latest

~/devel/sailfish $ sfdk config --push target SailfishOS-5.0.0.62-aarch64
~/devel/sailfish $ curl https://www.python.org/ftp/python/3.13.6/Python-3.13.6.tgz -O
~/devel/sailfish $ curl https://lanseg.github.io/2025-08-07/Python-3.13.6.spec -O
~/devel/sailfish $ sfdk --specfile ./Python-3.13.6.spec build --prepare
... Lot's of output...

~/devel/sailfish $ md5sum RPMS/*
4465b8ccfcd11838c93a2671d645f5e7  RPMS/Python-3.13.6-1.aarch64.rpm
ab2ace4188f8e5fe5d0d9a6406392abb  RPMS/Python-debuginfo-3.13.6-1.aarch64.rpm
b73d814fec8e3cb3af708126f5220534  RPMS/Python-debugsource-3.13.6-1.aarch64.rpm

That should be enough: a somewhat optimized, somewhat stripped, complete python distribution ready for common tasks including downloading youtube videos with yt-dlp.

Name:       Python
Summary:    Version 3.13 of the python interpreter
Version:    3.13.6
Release:    1
License:    Python-2.0.1
Source0:    %{name}-%{version}.tgz
URL:        https://www.python.org

Requires: openssl readline libuuid xz
Buildrequires: openssl-devel readline-devel libuuid-devel xz-devel

%description
Python is an interpreted, interactive, object-oriented programming language. This package contains
most of the need a programmable interface and standard Python modules.

%prep
tar -xvf %{name}-%{version}.tgz

%build
mkdir build
cd build
../%{name}-%{version}/configure --prefix=/usr/local --enable-optimizations --with-openssl=/usr/
make %{?_smp_mflags}

%install
cd build
DESTDIR=$RPM_BUILD_ROOT make install

%files
%defattr(-,root,root,-)
/usr/local/

The RPM spec unpacks, compiles, and prepares a large package

Part 2: yt-dlp

With the latest stable Python and pip, it's enough to just install it from pip:

defaultuser $ python3 -m pip install -U "yt-dlp[default]"

If you are going to use yt-dlp, there is no need to create an RPM package.

Part 3: VLC

Next time - VLC, probably. It works, but requires some styling to look like other Sailfish apps.

Terminal with a long error log and vlc window without video — Terminal where I started VLC with lots of error logs and broken app window

2025-07-24

Exploring MAX messenger

messengers

android

reverse engineering

MAX

MAX messenger is a government-forced Russian messenger that is planned to replace WhatsApp, Facebook messenger and other Western propaganda spreading machines. And the Telegram too. While there is nothing interesting about the interface, there could be something interesting inside so I installed the app on my Android emulator and started to explore it.

TL;DR

Nothing unexpected, yet another messenger with an unclear future. Not a KGB trojan surveillance app, just as private as any other messenger which is not focused on protecting your data. It also needs a working phone number to register, so you can forget about anonymity.

The only somewhat interesting thing is that it has a TamTam messenger inside, so it's probably based on a TamTam codebase.

Package details

Version information
Origin	RuStore
Package	ru.oneme.app
Version	25.7.1

Package content
md5:d88c78a92d75f0319af1a95b59e3867e	23M	base.apk
md5:7ef573467d338b6411ceb80cd278f9cb	2.8M	split_config.mdpi.apk
md5:0d978dc58e071136cddc8a815311154f	209K	split_config.ru.apk
md5:b3c75d266a1f1a55833cb9f052b9e075	26M	split_config.x86_64.apk

Permissions

Like any other messenger, it can read and write media and storage, use camera and microphone, prevent device lock, use vibration or show full screen intents, update app badges and settings on different android-based platforms.

Compared Max 25.7.1, Signal 7.45.3, Telegram 11.13.3, Threema 6.1.1, WeChat 8.0.68 and WhatsApp 2.25.21.4, detailed comparison table is here: comparison.txt.

Dependencies

The messenger uses open source libraries, some of them are well known and widely used in Java (Apache commons, Apache http, FasterXML, org/JSON, LZ4-java, OkHTTP3, WebRTC, etc). The list below contains specific or not so famous libraries:

Library	Description	Sources
Odnoklassniki (ru/ok)
android	A somewhat lower-level code (api, http, compression, etc)	N/A
messages	Reused code from OK.ru messenger
onechat	Utility classes for the reactions view
tamtam	Some Russian messenger
tracer	OK-Tech service for profiling and failure reporting, closed source.
util	LZ4 compression support
Analytics
tracker.my.com	Ads and analytics framework	GitHub
Facebook
fresco	System for displaying images in Android	GitHub
System and GUI libraries
BoltsFramework	Somewhat low-level async task management	GitHub
Conductor	BlueLine Labs' Framework for building View-based Android applications.	GitHub
GPUImageNativeLibrary	CATS OSS wanted to make something as similar as possible to iOS GPUImage.	GitHub
FastScroll	FutureMind's Scroll and section indexer for recycler view.	GitHub
ProcessPhoenix	Simplifies restarting application process	GitHub
ShortcutBadger	Show the count of unread messages as a badge on the app shortcut	GitHub
libphonenumber	Android port of Google's libphonenumber.	GitHub
Common libraries
MessagePack	Binary serialization format (like Google's protobuf)	GitHub
ReactiveX	Reactive programming for Java	GitHub

Used service web links

api.oneme.ru
max.ru
api-tg.oneme.ru
api-test.oneme.ru
api-test-gost[,2,3,4].oneme.ru
tracker-api.vk-analytics.ru

Debugging

There is not much I can say without the debugging, so I will need to set up this messenger on my phone to try it out.

2025-07-03

Making OpenConnect work on Sailfish OS, Part 2: Workaround

OpenConnect

Sailfish OS

VPN

At least, I need to make a note on how to use openconnect even without a GUI. Sailfish has an openconnect client by default, but it doesn't have default vpn scripts:

# openconnect  https://somehost.com/?somekey
...
/bin/sh: /etc/openconnect/vpnc-script: not found
Script '/etc/openconnect/vpnc-script' returned error 127
/bin/sh: /etc/openconnect/vpnc-script: not found
Script '/etc/openconnect/vpnc-script' returned error 127
...

But ones from the OpenConnect's git repository work well and copying them to /etc/openconnect directory solves an issue. Of course, if you don't want to put self generated files to the system directories, you can set the script location:

# openconnect -s /path/to/scripts/vpnc-script https://somehost.com/?somekey

Good thing that it works by itself, bad thing that it is not integrated with the UI, so I had to try other options:

Automatic cookie

I expected it to fetch the cookie and certificate from the server, after a login/password authentication, but it didn't work and I kept getting those getaddrinfo error messages.

Manual cookie

If connman cannot fetch the cookie, I should do it myself:

# openconnect --authenticate "https://somevpn.somedomain.com/?somesecretkey"
POST https://somevpn.somedomain.com/?somesecretkey
Connected to 23.32.165.3:443
SSL negotiation with somevpn.somedomain.com
Connected to HTTPS on somevpn.somedomain.com with ciphersuite TLSv1.2-ECKGB-ECFSB-AESMI6-WTF-OMG384
XML POST enabled
Please enter your username.
Username:
POST https://somevpn.somedomain.com/auth
Please enter your password.
Password:
POST https://somevpn.somedomain.com/auth
COOKIE='openconnect_strapkey=SGV5ISBBcmVudCB5b3Ugc25lYWt5IGJhc3RhcmQ/IEhlbGxvIHRoZXJlCg==; webvpn=cHdnZW4gLTEgb3Igc29tZXRoaW5nCg=='
HOST='147.237.7.27'
CONNECT_URL='https://somevpn.somedomain.com/'
FINGERPRINT='pin-sha256:WWVzLiBBIGZpbmdlcnByaW50Cg=='

And it worked! Just pay attention when copy-pasting text from the fingerterm, as it will add spaces where the text was broken by the side of the screen.

Sailfish OS OpenConnect configuration with cookie and certificate fingerprint — Sailfish OS OpenConnect configuration

Next time I'll try to automate cookie update.

2025-06-13

Making OpenConnect work on Sailfish OS

OpenConnect

Sailfish OS

VPN

connman

I'm one of the rare owners of a Sailfish OS device. I had an original Jolla, a Sony XPeria device with Sailfish and now I use Redeer C2. As for me, it is the only end-user ready Linux mobile OS, but one thing kept bothering me - an OpenConnect (ocserv) connection to my home network.

The problem was unclear: VPN connection kept flashing while trying to connect but each attempt ended with "Connection problem" error. No notifications, no detailed error messages. What is wrong with you? It worked on Android with the AnyConnect client, it worked on a generic Linux with terminal client for "openconnect". And the terminal client worked well even on the same Sailfish OS. What could be wrong?

Ok. Let's look in the logs:

[root@JollaC2 defaultuser]# journalctl -r | grep -i vpn
...
JollaC2 connman-vpnd[2317]: Failed to open HTTPS connection to my-vpn.somevds.ch/?somesecretkey
JollaC2 lipstick[2684]: [D] unknown:0 - VPN connection property changed: "State" QVariant(QString, "configuration") "/net/connman/vpn/connection/https___my_vpn_somevds_ch__somesecretkey_Sailfish OS_org" "Home"
JollaC2 connman-vpnd[2317]: getaddrinfo failed for host 'my-vpn.somevds.ch/?somesecretkey': Name or service not known
JollaC2 connman-vpnd[2317]: POST https://my-vpn.somevds.ch/?somesecretkey
...

The culprit! It thinks that my whole url with a path and a parameter was a hostname. A bit weird, because my OpenConnect server uses camouflage mode and I don't want to disable it, but connman doesn't know how to use it. I'll try to find a workaround next time.

2025-06-10

Configuring Green.ch DHCP on an OpenWRT router

OpenWRT

green.ch

DHCP

I bought a Banana Pi BPI-R3 router, installed an OpenWRT firmware, but couldn't get an IP address from my provider, green.ch. I tried to use the same MAC address, but it didn't work so I had to go deeper and what is the simplest way to debug such an issue? Compare the dumps and see the difference, of course. So I connected WAN port of a working router to my laptop and looked at DHCP packets, then I did the same for my non-working router and saw a difference:

Wireshark output dump, DHCP request that was ignored by provider — Ignored DHCP request

So, the only difference was that the provider expected the packet to be sent from a 802.1q vlan with ID 10. I've added it to the network config (/etc/config/network) and everything worked.

config device
        option type '8021q'
        option ifname 'wan'
        option vid '10'
        option name 'vlan10'

config interface 'wan'
        option proto 'dhcp'
        option device 'vlan10'
        option hostname '*'