elective-stereophonic
elective-stereophonic
Arithmetic coders are cool  
Please login or register.

Login with username, password and session length
Advanced search  

News:

Latest Nxt Client: Nxt 1.11.15

Author Topic: Arithmetic coders are cool  (Read 2871 times)

jl777

  • Hero Member
  • *****
  • Karma: +718/-123
  • Offline Offline
  • Posts: 6170
    • View Profile
Arithmetic coders are cool
« on: March 16, 2015, 08:31:35 pm »

I needed to increase entropy of the packets before encrypting and I dont like zlib simply due to its size and complexity.
So I wrote a huffman coder, even made it automatically find the best dictionary and got it compressing 10% better than zip.

But then CfB mentioned "arithmetic coding" and when I realized it solved the granularity problem of huffman codes (I knew this before just didnt remember to think about it) http://www.hpl.hp.com/techreports/2004/HPL-2004-76.pdf page 6

well I was 99% done with debugging the recursive huffman coding, but I could no longer feel good about it. So I got a reference arithmetic coder and started fiddling with it. After removing all the distracting comments and simplifying the code, it ended up being 220 lines and uses <10kb of memory. Unfortunately it is quite particular in the size of the integers it uses and only works with a 16bit  raw probability.

In addition to being much smaller and theoretically more efficient I also got it to do a form of encryption simply by seeding the initial probability tables. Since it is an adaptive coder any difference from initial values will totally mess up the output, which is bad if you are trying to make a codec, but it gave me the idea to seed it with an sha256 hash, eg. 1 bit for each byte.

After I got that to work, I searched the internet and found: ftp://intranet.dei.polimi.it/users/Carlo.Piccardi/VarieCda/ArticoliStudenti/i15.pdf

so it seems to be something that is somewhat known, but relatively new. The important aspect appears to be:
"The algorithm encrypts the
same sequence of symbols differently each time (and also gives
different degrees of compression). This dynamic nature of the
algorithm makes it immune to known plaintext attack, chosen
plaintext attack and meet in the middle attack."

After reading this I added a timestamp to the seed for the sha256 hash, so every second the output will be totally different. Now this is not meant to be any standalone encryption, just a preprocessor to the real encryption. So making it immune to plaintext attacks is the goal

James
Logged
There are over 1000 people in SuperNET slack! http://slackinvite.supernet.org/ automatically sends you an invite

I am just a simple C programmer

jl777

  • Hero Member
  • *****
  • Karma: +718/-123
  • Offline Offline
  • Posts: 6170
    • View Profile
Re: Arithmetic coders are cool
« Reply #1 on: March 16, 2015, 08:49:27 pm »

http://michael.dipperstein.com/arithmetic/ this is what I based my coder on

some useful stuff on arithmetic coders:

http://www.arturocampos.com/ac_arithmetic.html

http://www.sable.mcgill.ca/~ebodde/pubs/sable-tr-2007-5.pdf

http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/#part2

About huffman codes:
The problem with this scheme lies in the fact that Huffman codes have to be an integral number of bits long. For example, if the probability of a character is 1/3, the optimum number of bits to code that character is around 1.6. The Huffman coding scheme has to assign either 1 or 2 bits to the code, and either choice leads to a longer compressed message than is theoretically possible.

This non-optimal coding becomes a noticeable problem when the probability of a character becomes very high. If a statistical method can be developed that can assign a 90% probability to a given character, the optimal code size would be 0.15 bits. The Huffman coding system would probably assign a 1 bit code to the symbol, which is 6 times longer than is necessary.
Logged
There are over 1000 people in SuperNET slack! http://slackinvite.supernet.org/ automatically sends you an invite

I am just a simple C programmer

Come-from-Beyond

  • Hero Member
  • *****
  • Karma: +794/-671
  • Offline Offline
  • Posts: 4013
    • View Profile
Re: Arithmetic coders are cool
« Reply #2 on: March 16, 2015, 09:39:32 pm »

This non-optimal coding becomes a noticeable problem when the probability of a character becomes very high. If a statistical method can be developed that can assign a 90% probability to a given character, the optimal code size would be 0.15 bits. The Huffman coding system would probably assign a 1 bit code to the symbol, which is 6 times longer than is necessary.

Would using fractional radix help you?
Logged

jl777

  • Hero Member
  • *****
  • Karma: +718/-123
  • Offline Offline
  • Posts: 6170
    • View Profile
Re: Arithmetic coders are cool
« Reply #3 on: March 16, 2015, 09:48:17 pm »

This non-optimal coding becomes a noticeable problem when the probability of a character becomes very high. If a statistical method can be developed that can assign a 90% probability to a given character, the optimal code size would be 0.15 bits. The Huffman coding system would probably assign a 1 bit code to the symbol, which is 6 times longer than is necessary.

Would using fractional radix help you?
dont see how to do that with huffman output bits.
arithmetic coder solves the problem as it allows using fractional bits
really it is a simple idea of just mapping a continuous range of numbers to each input symbol and this range certainly doesnt have to use up half the space
Logged
There are over 1000 people in SuperNET slack! http://slackinvite.supernet.org/ automatically sends you an invite

I am just a simple C programmer

Come-from-Beyond

  • Hero Member
  • *****
  • Karma: +794/-671
  • Offline Offline
  • Posts: 4013
    • View Profile
Re: Arithmetic coders are cool
« Reply #4 on: March 16, 2015, 09:54:13 pm »

dont see how to do that with huffman output bits.

You assume that input characters are 16-bit ones. If you looked at the input as at a very big number and started moduloing it by, say, 777 then you would get a stream of 9.6-bit characters...

PS: I believe this is enough to push your imagination.
Logged

Daedelus

  • Hero Member
  • *****
  • Karma: +230/-12
  • Offline Offline
  • Posts: 3280
    • View Profile
Re: Arithmetic coders are cool
« Reply #5 on: March 16, 2015, 10:06:47 pm »

...moduloing it by...

Sometimes I think you guys are just trying to mess with us..
Logged
NXT: NXT-4CS7-S4N5-PTH5-A8R2Q

jl777

  • Hero Member
  • *****
  • Karma: +718/-123
  • Offline Offline
  • Posts: 6170
    • View Profile
Re: Arithmetic coders are cool
« Reply #6 on: March 17, 2015, 01:23:31 am »

...moduloing it by...

Sometimes I think you guys are just trying to mess with us..
not at all. the world is not binary, nor is it trinary. So that means there is no surprise things can be represented using any number (even fractional) of bits, for input or output

James
Logged
There are over 1000 people in SuperNET slack! http://slackinvite.supernet.org/ automatically sends you an invite

I am just a simple C programmer

jl777

  • Hero Member
  • *****
  • Karma: +718/-123
  • Offline Offline
  • Posts: 6170
    • View Profile
Re: Arithmetic coders are cool
« Reply #7 on: March 17, 2015, 05:32:14 pm »

Finally got a two stage encryption working which has some very nice properties. The following data is from sending "hello world" twice and "iello world" twice. I am using the unixtime as a plaintext nonce and automatically delay an identical packet from going out the same second. This assures that the same seeds wont be used.

the timestamp acts as a nonce and it is hashed with other unique values to create onetime values:
a) The arithmetic coder's probability tables are seeded with SHA256(timestamp + shared curve25519 secret)
b) A onetime XOR pad  http://en.wikipedia.org/wiki/One-time_pad is created using HASH(SHA256(nxt address + timestamp) + 384bit shared secret)

"In cryptography, a one-time pad (OTP) is an encryption technique that cannot be cracked if used correctly"
"Given perfect secrecy, in contrast to conventional symmetric encryption, OTP is immune even to brute-force attacks"

If you look at the output of the two stages, you will see that even with the identical string, the outputs values are dramatically different. Also just changing one bit "hello world" -> "iello world" also creates a large difference in output. The seeding of the arithmetic coder with a deterministic hash creates a form of encryption that is relatively new, but still unknown as to its strength. It might be possible to rely on just this, but its primary function is to prevent plaintext attacks by increasing the entropy level of the input data.

Now believe I am using the OTP method correctly and if I am correct then this is qualitatively better than conventional methods which are definitely threatened by Quantum Computers, or just a breakthrough in math:

"Conventional symmetric encryption algorithms use complex patterns of substitution and transpositions. For the best of these currently in use, it is not known whether there can be a cryptanalytic procedure which can reverse (or, usefully, partially reverse) these transformations without knowing the key used during encryption. Asymmetric encryption algorithms depend on mathematical problems that are thought to be difficult to solve, such as integer factorization and discrete logarithms. However there is no proof that these problems are hard, and a mathematical breakthrough could make existing systems vulnerable to attack."

"One-time pads are "information-theoretically secure" in that the encrypted message (i.e., the ciphertext) provides no information about the original message to a cryptanalyst (except the maximum possible length[15] of the message). This is a very strong notion of security first developed during WWII by Claude Shannon and proved, mathematically, to be true for the one-time pad by Shannon about the same time. His result was published in the Bell Labs Technical Journal in 1949.[16] Properly used one-time pads are secure in this sense even against adversaries with infinite computational power."

The maximum possible length of the message is 1392 bytes as it is limited by UDP packet size limits, but of course a message can be chained with multiple packets and now that I have the low level crypto primitives of hashing, tokenizing, authenticating, encrypting/decrypting, encoding/decoding debugged I can move to the next level of (more) secure key exchange and network defenses, along with validating high performance routing and generally getting the SuperNET network ready for production

The current key exchange uses the NXT curve25519 properties of shared secrets http://cr.yp.to/ecdh.html so its not like it is weak, but I want to make SuperNET privacy the best in the world

James

(6395019458037742701 9652907354821126831) packet.(hello world) timestamp.1426611462 [6f77206f6c6c6568] len.11 -> 9652907354821126831
set prevtimestamp.1426611462
06 5d 08 55 00 00 00 00 fb 81 26 18 7f 08 d8 02 c7 e9 5b ea 78 b0 88 e8 69 4b 93 fb 5d 0c cf 7a b5 5c 9b 1d 70 e7 37 06 97 ce ff a8 e4 81 32 05 77 2d 76 66 b7 6b da 01 00 00 00 00 00 60 41 b6 64 27 coding.66
06 5d 08 55 00 00 00 00 db 77 7e 75 50 f7 bb c5 9b f0 55 b3 0f 4a f3 f3 ff b2 29 0e 6b db 30 76 ef 18 da 21 cb 61 82 86 43 16 e9 12 e3 df ec 47 76 85 4f cd ae 30 44 19 e6 d1 57 2c dd 73 17 7a 9f 1d 9a 5c f7 1b 50 ef 9b cb e0 c7 b9 f7 b9 04 ac ab 58 6b 64 43 8c 5a 42 ba 2f 7d 81 84 b1 10 16 d3 02 65 ... encrypted.1392

START RECV leverage.5
(9652907354821126831 6395019458037742701) packet.(ENCRYPTED) timestamp.1426611462 [c5bbf750757e77db] len.1392 -> 9652907354821126831
AUTHENTICATED packet from 6395019458037742701 len.11 leverage.5 numrounds.3 | hit 93.45%
GOT PACKET! (hello world).11
. . . . (6395019458037742701 9652907354821126831) packet.(hello world) timestamp.1426611463 [6f77206f6c6c6568] len.11 -> 9652907354821126831
set prevtimestamp.1426611463
07 5d 08 55 00 00 00 00 16 e3 90 ac 0a 32 3a 3f 32 ea 76 9e ad af 8f 06 6f dd 60 c4 d4 9f 71 92 cd 12 2b 40 21 8d 38 e2 3f 81 56 e5 72 da ad 0d be ad 6c 96 fd 4c 8b 26 f9 07 00 00 00 00 00 18 ee dd 8a 2a 78 coding.69
07 5d 08 55 00 00 00 00 86 84 26 7b 86 42 52 7a 22 78 4d 9d 97 fe 7b 5e 74 cf a3 e0 7a 44 68 14 17 49 45 72 e5 81 9b f2 e2 3f 51 c4 40 28 eb d1 2e 15 ab aa e8 a3 3f c8 ba 96 8c 14 30 dc d2 1b e2 bc 22 53 06 ad e9 54 64 44 b4 49 1e d6 8e 96 e6 c0 48 29 10 6d 3e 85 51 40 69 65 2b 9c 02 22 7c 4c d7 4f ... encrypted.1392

START RECV leverage.5
(9652907354821126831 6395019458037742701) packet.(ENCRYPTED) timestamp.1426611463 [7a5242867b268486] len.1392 -> 9652907354821126831
AUTHENTICATED packet from 6395019458037742701 len.11 leverage.5 numrounds.3 | hit 17.25%
GOT PACKET! (hello world).11
(6395019458037742701 9652907354821126831) packet.(iello world) timestamp.1426611464 [6f77206f6c6c6569] len.11 -> 9652907354821126831
set prevtimestamp.1426611464
08 5d 08 55 00 00 00 00 3c 99 96 c2 3e 2a d7 1e 43 22 2e 77 aa 2b 65 09 b7 0f ec a4 3a a1 1e 9e 86 c4 c5 b8 39 94 2b b7 ac 89 3e 51 33 8a 0b 50 a5 8f 70 4c dc a7 55 c6 00 00 00 00 50 34 26 b6 90 coding.65
08 5d 08 55 00 00 00 00 14 9a 77 70 c9 5a 80 06 16 e5 f3 ad 05 88 ed 2f 0c d0 7e bb 79 3f 97 e4 82 d0 27 1c 95 9d cc d7 59 c9 ff b1 0b c5 01 ed 31 38 3d ce cc 54 9b 1d 46 37 1d f1 e8 d8 47 d3 dc d3 66 12 05 42 ab e7 cb ac 0e 5c 6c 75 47 00 be 61 cf 5f 85 2f 88 4f 8f 56 5a 9b 53 99 e9 18 34 fb a7 22 ... encrypted.1392

START RECV leverage.5
(9652907354821126831 6395019458037742701) packet.(ENCRYPTED) timestamp.1426611464 [6805ac970779a14] len.1392 -> 9652907354821126831
AUTHENTICATED packet from 6395019458037742701 len.11 leverage.5 numrounds.3 | hit 38.50%
GOT PACKET! (iello world).11
. . . . (6395019458037742701 9652907354821126831) packet.(iello world) timestamp.1426611465 [6f77206f6c6c6569] len.11 -> 9652907354821126831
set prevtimestamp.1426611465
09 5d 08 55 00 00 00 00 bf c5 64 25 a1 d8 ac 05 c5 42 ec 72 05 85 78 a0 de 1a de 5a 91 aa dd b3 ae a7 f6 bf 93 1a fd ec 85 7f 16 c3 80 79 aa 78 83 22 4b 2b af 13 83 45 0a 00 00 00 80 e3 db 71 f9 1a coding.66
09 5d 08 55 00 00 00 00 04 89 28 63 2d 92 77 98 63 b2 8f ec c1 64 d4 44 2e 0f 7d 9e fa 21 f2 01 36 88 d5 59 8d e7 0d 16 e8 8a 58 e9 35 be 5e 4c 1e 85 1c 80 16 58 d7 d7 65 50 34 be 6f 62 34 15 3f fd 3f 11 e3 2d 49 98 2c 10 e3 40 23 9f 6a 11 6f 7b e8 26 8c 73 3e 67 60 f1 96 ca 51 8c a2 b5 9d ea 46 0b ... encrypted.1392

START RECV leverage.5
(9652907354821126831 6395019458037742701) packet.(ENCRYPTED) timestamp.1426611465 [9877922d63288904] len.1392 -> 9652907354821126831
AUTHENTICATED packet from 6395019458037742701 len.11 leverage.5 numrounds.3 | hit 11.75%
GOT PACKET! (iello world).11
Logged
There are over 1000 people in SuperNET slack! http://slackinvite.supernet.org/ automatically sends you an invite

I am just a simple C programmer

jl777

  • Hero Member
  • *****
  • Karma: +718/-123
  • Offline Offline
  • Posts: 6170
    • View Profile
Re: Arithmetic coders are cool
« Reply #8 on: March 19, 2015, 12:50:56 pm »

small world networks are cool too!

#define MAXHOPS 6
#define MAXTESTPEERS 16
#define NETWORKSIZE 4096

finally convinced nanomsg to build a small world network topology, so I can test independent node algos in a single CPU that uses the same code as when using a real network. it is called middleware. and of course it is written in C

The problem with a randomly connected small world network is that it is making random connections, but this turns out to be the best as any sort of bias in how the nodes are connected reduces the coverage and increases the number of packets needed to be sent

I experimented with many approaches, but as usual the simplest is most effective and practical. At first it was doing almost 300 packets to get to a destination, but over time (~160K packets) it is down to average of 100 for the average search as it is learning the paths that work by storing the failures in bloom filters.

So that was my goal for this session, to get a properly simulated small world network using nanomsg. I even tested 16386 nodes and it has similar behaviors, but anything over that would not run locally. In the wild, you only have to run one node, but it is good to know that the actual overhead for this is so small.

Next step is to start adding the fancy encryption and various defenses to the simulated network in a way that I can merge it into the live network with minimal changes.

James

P.S. at 400K packets it is down to 60 packets per search, which is only 4 times the number of peers. 400K packets sounds like a lot, but across 4K nodes, that is just 100 packets per node, so this is quite rapid learning
Logged
There are over 1000 people in SuperNET slack! http://slackinvite.supernet.org/ automatically sends you an invite

I am just a simple C programmer
 

elective-stereophonic
elective-stereophonic
assembly
assembly