## The Balanced Partition Problem

March 7, 2011

The Balanced Partition Problem

The other night, at about 11pm, a friend of mine decided to start asking me for help on his homework that was due that night.  I normally just try to provide general hints and not do the problems myself (it’s his homework), when I see an interesting problem I can play around with in Python, it’s tough to ignore.

The problem was essentially something like: ‘you have N buckets which each have a different weight.  give an algorithm that separates them into two piles as close to the same weight as possible’.  Which is one way to state ‘The Balanced Partition Problem’, and Python is a great language for quickly throwing together and testing algorithms.

First things first: build the different sized ‘buckets’:

buckets = []
for x in xrange(0,40):
y = random.randint(2,500)
while y in buckets:
y = random.randint(2,500)
buckets.append(y)

Attempt #1:

The first idea that occurred to me was to solve it like you would a simple root solver.  This was the code I came up with:

t = time.time()
# method 1
suba = list(buckets)
subb = list()
suma = sum(suba)
sumb = sum(subb)
diff = abs(suma-sumb)

it = 0
while min(suba)*2 < diff:
(v,i) = min(izip((abs(x*2-diff) for x in suba) ,count()))
v = suba[i]
suba.remove(v)
subb.append(v)
suma -= v
sumb += v
diff = suma-sumb

it += 1

print 'meth1: {0}s'.format(time.time()-t)
print '{0} iterations.'.format(it)
print 'suma: {0}\nsumb: {1}'.format(suma,sumb)
print ' diff: {0}, min ele: {1}'.format(abs(suma-sumb),min(buckets))

The code starts with all buckets in ‘set a’, and each iteration it looks for the element in ‘set a’ that will come closest to reducing the difference to zero and moving it to ‘set b’.  This stops when the difference is too small to be fixed by the smallest element in ‘set a’.

This algorithm is very fast (linear), and gives two sets that are pretty close to equal. Unfortunately this doesn’t always produce the best result, as there may be cases where the difference between a single element in one set and a subset of elements in the other set is smaller than the smallest element in either set.  This method doesn’t account for these different types of combinations, but for a simple and (very) quick method, it gets pretty close to equal.

Attempt #2: Brute Force

How close is method 1?  Well, I guess we have to find the best possible solution to figure that out, and if you can’t think of a clever way to do it, there is always the brute force method.  Since order doesn’t matter in a sum, we just need to iterate over every possible combination of subsets and pick the one which has the sum closest to half that of the sum of the full set. Since we’re using Python, we can use the combinations function (since 2.6), which will improve our speed a bit:

t = time.time()
##### method 2
best = sum(buckets)
target = best/2
for i in xrange(0, len(buckets)):
for s in combinations(buckets,i):
x = sum(s)
if abs(x-target) < abs(best-target):
best = x
bests = s

suba = sorted(bests)
subb = sorted(set(buckets)-set(bests))
suma = sum(suba)
sumb = sum(subb)
print 'meth2: {0}s'.format(time.time()-t)
print 'suba: {0}\nsubb: {1}\nsuma: {2}\nsumb: {3}'.format(sorted(suba),sorted(subb),suma,sumb)
print ' diff: {0}, min ele: {1}'.format(abs(suma-sumb),min(buckets))

Now I was originally using a set size of 20, which this method does in a couple seconds, but because of the complexity of the combination function, bumping that size up to just 30 takes method 2 over 12 minutes on my desktop (method 1 takes 1.5e-5s). Making the size any larger makes this method unfeasible.

Attempt #3: A Better Solution

Before asking me for help, my friend found this clever solution to the problem that’s much better than the brute force one:

http://people.csail.mit.edu/bdean/6.046/dp/dp_4.swf

Pretty nice, but then what do you expect from MIT?

Looking at his method, he says that checking $P(i,j)$ for all possible $i,j \in (0..n,0..nk)$ will take $O(n^2k)$ time.  This is true, but why check them all?  From the recursive equation he gives:

$P(i,j) = \max(P(i-1,j), P(i-1,j-A_i))$

and the initial condition:

$P(0, 0) = 1, P(0, j\neq0) = 0$

The recursive formula can be used to generate all the cases where $P(i,j) = 1$.  Additionally, because I am using Python, I decided to make P a dictionary (actually a dictionary of dictionaries) to keep track of the subsets.  Using this method, we need to generate considerably less than $n^2k$ subsets.  The code is:

t = time.time()
##### method 3
best = sum(buckets)
target = best/2
bests = ()
P = {}
P[-1] = {0:()}
for i in xrange(len(buckets)):
x = buckets[i]
P[i] = {}
for j in P[i-1].keys():
P[i][j] = P[i-1][j]
P[i][j+x] = P[i-1][j] + (x,)

if abs((j+x)-target)<abs(best-target):
best = j+x
bests = P[i][j+x]

suba = sorted(bests)
subb = sorted(set(buckets)-set(suba))
suma = sum(suba)
sumb = sum(subb)
print 'meth3: {0}s'.format(time.time()-t)
print 'suma: {0}\nsumb: {1}'.format(suma,sumb)
print ' diff: {0}, min ele: {1}'.format(abs(suma-sumb),min(buckets))

This method yields the best answer very fast, and the python code could probably be optimized to improve the speed.  On my desktop it takes ~0.1s for 30 ‘buckets’ (vs. 12+ minutes for method 2), and still only ~20s for 200 ‘buckets’. It is interesting to note that method 1 produces very close to the best result (usually single different differences for 200 buckets, if not exactly zero) and still takes 5e-3s for 200 buckets, so for an approximation it is quite good.

The final improvement can be made by realizing that since your target is half the sum of the full set, then one of the subsets must be equal to or less than this target, so we can skip computing and checking sums over this value.  This two line change cuts the execution time in half:

t = time.time()
##### method 3
best = sum(buckets)
target = best/2
bests = ()
P = {}
P[-1] = {0:()}
for i in xrange(len(buckets)):
x = buckets[i]
P[i] = {}
for j in P[i-1].keys():
P[i][j] = P[i-1][j]
if j+x > target:
continue
P[i][j+x] = P[i-1][j] + (x,)

if abs((j+x)-target)<abs(best-target):
best = j+x
bests = P[i][j+x]

suba = sorted(bests)
subb = sorted(set(buckets)-set(suba))
suma = sum(suba)
sumb = sum(subb)
print 'meth3: {0}s'.format(time.time()-t)
print 'suma: {0}\nsumb: {1}'.format(suma,sumb)
print ' diff: {0}, min ele: {1}'.format(abs(suma-sumb),min(buckets))

Ok, I know I said the final improvement, but I thought to myself, ‘what if I check if the target is reached early, then bail out’. This requires a little refactoring of the code into a function so that I can use return to break out of it early:

t = time.time()
##### method 3
best = sum(buckets)
bests = ()
def m3():
global best, bests
target = best/2
P = {}
P[-1] = {0:()}
for i in xrange(len(buckets)):
x = buckets[i]
P[i] = {}
for j in P[i-1].keys():
P[i][j] = P[i-1][j]
if j+x > target:
continue
P[i][j+x] = P[i-1][j] + (x,)

if abs((j+x)-target)<abs(best-target):
best = j+x
bests = P[i][j+x]
if best == target:
return

m3()

suba = sorted(bests)
subb = sorted(set(buckets)-set(suba))
suma = sum(suba)
sumb = sum(subb)
print 'meth3: {0}s'.format(time.time()-t)
print 'suma: {0}\nsumb: {1}'.format(suma,sumb)
print ' diff: {0}, min ele: {1}'.format(abs(suma-sumb),min(buckets))

As it happens, after running several tests, it seems that the function reaches the target early quite often, and this change cuts the runtime, on average, down to $1/3$ of what it was.

Thats all I’ve done with it so far.  It can probably be improved further, but for a just-for-fun little exercise with Python, it’s pretty interesting (to me at least).

## ecryptfs automount a (somewhat) easier way

May 3, 2010

So I’ve been messing with transparent file encryption lately because it’s a good way to easily add some protection to private files, and if you use an online backup service or cart around a laptop you’d be crazy not to encrypt your files.

For laptops, there are tons of options, and I’ve previously used full disk encryption via TrueCrypt on windows and dm-crypt on Ubuntu (via alternate install cd).

For online backup services like Ubuntu One/Wuala/Dropbox/etc, these aren’t as useful as per-file encryption. The main reason is that the encrypted disk is actually a single file, and any change to any file on the encrypted disk requires the entire block file to be re-uploaded to the online service.

Fortunately there are some very good (and open source) per-file encryption file systems. With these systems, when one file is changed, only that file needs to be re-transmitted to the backup service. The two I’ve tried are ecryptfs and encfs.

encfs:

encfs is an extremely simple to use transparent per-file encrypted (FUSE) filesystem. The website has more than enough information to get it setup.  There are also plenty of guides on using it, including this clever one that uses the GNOME keyring.

ecryptfs:

ecryptfs is effectively the same as encfs with a different implementation.  The most important difference (imo) is that ecryptfs is a kernel mode filesystem, where encfs is a FUSE filesystem.

The Goal:

I wanted to put some documents on Ubuntu One as an off-site backup and to make them accessible from a laptop on campus if needed. Additionally, I would like to have a separately encrypted directory (or mountpoint) that is not sync’d online. This will be used for application data folders that I’d like to encrypt (like ~/.mozilla).

Solution:

Using ubuntu, the first part is extremely easy.  I just followed the Ubuntu Guide, then created a symlink that put the encrypted data in the ‘Ubuntu One’ directory:

ln -s ~/Ubuntu\ One\private ~/.Private

The second part is a little more work.  Ubuntu’s automatic setup scripts are great for creating a single encrypted mountpoint, but the second one has to be created and setup manually.

This is easy with encfs, but I chose to do it with ecryptfs because:

• ecryptfs is already built into the kernel and the utilities are already installed (I’m already using it for ~/Private)
• encfs is a FUSE model, so I expect ecryptfs performance to be a little better.
• I wanted to use the same directory for both encrypted and unencrypted data (overlay) which encfs doesn’t seem to support.

I liked how the blog post that I keep linking used python+gnome keyring to implement automount, so I wanted to try and do the same with ecryptfs.

There are a few changes I had to make to the encfs guide to make it work with ecryptfs:

• Add an fstab entry to allow non-interactive, non-root mounting of the directory
• use pexpect module instead of subprocess module in automount script

Setup:

First things first, create the encrypted mountpoint:

mkdir ~/.appdata
sudo mount -t ecryptfs ~/.appdata ~/.appdata
Passphrase: <enter passphrase>

Then select appropriate options or use defaults (the only one I change is to turn on filename encryption).

Now, to make it mountable as non-root, I add an entry in fstab.  An easy way to see what should go in fstab, is to check mtab:

cat /etc/mtab |grep ecryptfs

Now I copy this to fstab and add the following options:

• users,noauto – standard mount options
• key=passphrase – not sure if this is necessary, might be the default
• no_sig_cache – don’t check the sig cache (needed for non-interactive mode)
• ecryptfs_passthrough=n – don’t know why this isn’t in mtab, not including it makes it prompt

/home/bjp/.appdata /home/bjp/.appdata ecryptfs users,noauto,rw,key=passphrase,no_sig_cache,ecryptfs_unlink_sigs,ecryptfs_fnek_sig=####,ecryptfs_key_bytes=16,ecryptfs_cipher=aes,ecryptfs_sig=#####,ecryptfs_passthrough=n 0 0

NOTE: Because of a bug with ecryptfs, you have to setuid mount.ecryptfs to do this as a normal user:

sudo chmod +s /sbin/mount.ecryptfs

Now we should be able to mount it as a user with “mount ~/.appdata” (after sudo umount ~/.appdata). To make it automount, we will modify the encfs python script.  I tried for a couple hours to try and get it to work with the subprocess module, but for some reason I couldn’t get ecryptfs to take the password from stdin.  I don’t know if it’s a bug or what, but none of the options (even the passwd=stdin option) would make it accept a passphrase from stdin.

To get around this, I used the pexpect module.

Also, don’t forget to add the passphrase to the gnome keyring (I used lp:gkeyring):

python gkeyring.py --set -n "Ecryptfs Passphrase" -p ecryptfs=appdata --keyring login

The modified script is:

#!/usr/bin/python

import os.path
import subprocess
import sys
import gtk
import pexpect
import gnomekeyring as gk

# paths constants:
PATH_FSTAB = os.path.expanduser("~/.appdata")

# get passphrase from keyring:
try:
items = gk.find_items_sync(gk.ITEM_GENERIC_SECRET, {"ecryptfs": "appdata"})
item = items[0] # clean up your keyring if this list has multiple items
except gk.NoMatchError:
print("no entry in keyring")
sys.exit(1)

# run encfs:
cmd = ['mount %s' % PATH_FSTAB]
try:
p = pexpect.spawn(cmd[0])
#p.logfile = sys.stdout
p.expect('Passphrase:')
p.sendline(item.secret)
p.expect(pexpect.EOF)
p.close()

except:
print 'Error mounting %s.' % cmd[0]
sys.exit(1)

print 'Mount successful'

NOTES:

• Using the same directory for both ‘source’ and ‘destination’ isn’t really necessary, and if you want access to the encrypted data (like for an online service) you should use different directories.
• Using a generated key (from ecryptfs-manager) is much more secure than using a passphrase, but if you’re going to save it on the hdd (kinda necessary for non-interactive mode), than it’s not good for laptops (whats the point if they steal the key with the ecrypted files) but fine for online services… as long as you don’t upload your key.

## A Blog?

February 7, 2010

So apparently I’ve jumped on the blogging bandwagon.

Not really.  I wanted a place to post random info on code/utilities/howtos that I find useful so that I don’t forget them the every next day, and I figured a blog is a good way to make the information accessible and somewhat persistent.

I’m not expecting anyone else to read this blog (i’m not that interesting), so if you’re not me, you’ve probably taken a wrong turn somewhere.