ISCID Forums


Post New Topic  Post A Reply
my profile | search | faq | forum home
  next oldest topic   next newest topic
» ISCID Forums   » General   » Brainstorms   » Shannon and normal distributions

   
Author Topic: Shannon and normal distributions
Bruce Fast
Member
Member # 924

Icon 1 posted 11. April 2006 16:27      Profile for Bruce Fast   Email Bruce Fast   Send New Private Message       Edit/Delete Post 
The "Does randomness ride on ignorance" thread has spawned a discussion, initiated by me, of Shannon's equation. To not dilute that meaningful thread, I have moved the Shannon conversation to its own thread.

If I understand correctly, Shannon differentiates information from randomness by analyzing compressability. However, there is a lot of naturally occurring situations that produce a normal distribution (bell curve). The data plot to produce this common curve. The curve produces some information: median, mean, mode, standard deviation, skew. However, unlike with DNA and computer software, there is no information embodied in each individual data point.

Question for the experts, am I wrong in assuming that Shannon would view a normal distribution as being as information rich, or nearly so, as a sequence of DNA is, or a slice of computer program is? If so, does Shannon badly miss the point of what real information is?

IP: Logged
David L. Hagen
Member
Member # 323

Icon 1 posted 12. April 2006 16:04      Profile for David L. Hagen   Email David L. Hagen   Send New Private Message       Edit/Delete Post 
Bruce
Good to separate out discussion of Shannon information.

Please clarify/point out how/where:
quote:
Shannon differentiates information from randomness by analyzing compressability.
I understood Shannon Information to be the information capacity of an information channel, and that this varied with the probability distribution of the "letters" of the particular code being used. Thus I understand "compressibility" being a measure of the reduction in information capacity due to the particular code used.

However, if we used a code with a uniform distribution of "letters," that would still give a Shannon Information measure of the communication capacity of that communication channel. As such I do not see how that would differentiate it from the measure of a sequence of random letters using the same code. Both would have the same "Shannon information." i.e., I thought "Shannon Information" was a measure of communication capacity not content. i.e., the envelope not the enclosed letter.

Thus, may I recommend differentiating between the issues of communication capacity, vs content.
i.e., the envelope vs the content where content is looking at randomness vs Complex Specified Information (CSI).

IP: Logged
Albert Voie
Member
Member # 1941

Icon 1 posted 12. April 2006 18:11      Profile for Albert Voie   Email Albert Voie   Send New Private Message       Edit/Delete Post 
From algorithmic information theory we have that
if H(X) is much less than X (in bits), X is ordered with little information content. If H(X) is not less than X, X is random with much information content.
That's all folks - it is impossible to distinguish a computer program from a random string of bits by these definitions.

The "content" would be meaning (semantics) and rules of inference (syntax). In life the rules of inference correspond to e.g. the mechanism of a molecular machine. If we had a way to quantify all amino acid sequences that returned a functional machine we could calculate, or estimate the information content in it (if we assume that information has something to do with probabilities). Semantics is worse because signs and symbols are completely arbitrary. The information content is infinite (if we still stick to probabilities). Further, semantics seems to presuppose semantics as I have discussed in my paper.

IP: Logged
Bruce Fast
Member
Member # 924

Icon 1 posted 13. April 2006 14:17      Profile for Bruce Fast   Email Bruce Fast   Send New Private Message       Edit/Delete Post 
My understanding from other sources is that Shannon's equation is considered by many biologists as the bellweather measure of information. If so, and if Shannong cannot tell the difference between a normal distribution and the precise ordering of information that exists in DNA and in computer code, then these biologists are hurtin' bad when it comes to truly grasping the significance of information.

I fail to understand this level of ignorance, however, because we use information so extensively. I, of course, produced a whole lot of information in this blerb, already exceeding UPB, I'm sure. Maybe we are just so close to information that we can't easily focus on it.

IP: Logged
David L. Hagen
Member
Member # 323

Icon 1 posted 14. April 2006 10:07      Profile for David L. Hagen   Email David L. Hagen   Send New Private Message       Edit/Delete Post 
Albert
quote:
If H(X) is not less than X, X is random with much information content. . . .it is impossible to distinguish a computer program from a random string of bits by these definitions.
I agree that Shanon information (nor Kolmogorov complexity) may not be able to distinguish between a highly complex computer program vs randomness. However, in your definition please distinguish between randomness and CSI.
i.e. recommend stating:

if H(X) is much less than X (in bits), X is highly ordered with little randomness and/or Complex Specified Information, and little Shanon information capacity or Kolmogorov Complexity.

If H(X) is a little less than X, X has little order, with high randomness and/or Complex Specified Information, and X has a high Shanon information capacity or Kolmogorov Complexity.

[ 20. August 2006, 20:39: Message edited by: David L. Hagen ]

IP: Logged
secondclass
Member
Member # 1957

Icon 1 posted 24. April 2006 19:32      Profile for secondclass   Email secondclass   Send New Private Message       Edit/Delete Post 
Regarding a highly complex computer program, I'm not sure which complexity metric is intended (there are many), but I submit that algorithmic information theory can distinguish a computer program, or any other meaningful information, from random strings.

Most programming languages, even machine language, have syntactical constraints. If we combine a valid program with the syntax rules of the language, the result will be more compressible than a string of gibberish combined with the same rules. (For languages with no syntax rules, detection is trivial, as all strings are valid programs.) If we want to detect complexity, throw a formal description of complexity into the mix and use the same method. To detect natural language text, just combine the string in question with a dictionary and compare its compressibility with a gibberish/dictionary combination.

In general, this method can detect any property that lends itself to formal description. If the information in our heads can be expressed formally, the possibilities become very interesting.

IP: Logged
Bruce Fast
Member
Member # 924

Icon 1 posted 24. April 2006 21:21      Profile for Bruce Fast   Email Bruce Fast   Send New Private Message       Edit/Delete Post 
secondclass: "If we combine a valid program with the syntax rules of the language..."
Hmmm, two points.

1. What if we don't know syntax rules of the language -- the order of DNA is clearly a language, do we "know the rules of the language" enough to obtain compression advantage? To some extent I bet we do, but only to some extent.

2. Once we know the "rules of the language", we still discover that we cannot completely compress a computer program. That which cannot be compressed, even after we apply the "rules of the language" that we know, is the actual information content of the program, yes?

IP: Logged
secondclass
Member
Member # 1957

Icon 1 posted 25. April 2006 11:41      Profile for secondclass   Email secondclass   Send New Private Message       Edit/Delete Post 
Bruce,

1. I'm pretty clueless when it comes to DNA, but if we don't know the rules of the language, we can use examples of valid strings instead, and the rules will be reflected in the way that the aggregate string is maximally compressed. (In practice, we need the rules to guide us in compression, so this is purely theoretical.) In general, if K(A+B) is less than K(A)+K(B), where the former plus sign means concatenation, we can conclude that A and B adhere, at least partially, to some common rules.

2. That's correct. When the syntax is separated out, the information that remains is what makes the program unique from other programs.

IP: Logged
Christopher D. Beling
Member
Member # 723

Icon 1 posted 06. May 2006 11:12      Profile for Christopher D. Beling     Send New Private Message       Edit/Delete Post 
Bruce, you made the comment:
quote:
However, there is a lot of naturally occurring situations that produce a normal distribution (bell curve). The data plot to produce this common curve. The curve produces some information: median, mean, mode, standard deviation, skew. However, unlike with DNA and computer software, there is no information embodied in each individual data point.

and from the replies I have seen - no one has addressed the issue you raise. Please let me have a go - and in doing so one of my aims is to cast some doubt on that last statement "there is no information embodied in each individual data point"
Lets focus in on a specific example (i.e. students measuring the length of something in a class ) that would produce a "bell curve". [Mathematically such "bell curves" are referred to either as Gaussians or "normal distributions" (although because you bring in "skew" I see you are perhaps talking more generally).] The example I take is 100 students in a class being asked to measure the length of a metal rod in the laboratory with a meter rule. They are told to measure with an accuracy of 0.1 mm. Are they all going to come up with the same number - NO: There will be a scatter, and as all experimentalists know, it will have a Gaussian distribution (with a mean and standard deviation, the mean being the "best value" of the measurement and the standard deviation the "error"). Now here each student that takes a measurement is providing one piece of new information on the length of the rod. Do you agree? It can easily be shown too that with the 100 students presenting measurements - the error on the mean (which is also distributed as a Gaussian "bell" curve) is reduced by squareroot(100)=10, so that accuracy has been increased (more information obtained) on the actual length of the rod by more measurement.
Let's take a made up sample of data:
Length(cm)--bin------No of students---Prob
50.10-------0000-----0----------------0
50.11-------1000-----0----------------0
50.12-------0100-----0----------------0
50.13-------1100-----0----------------0
50.14-------0010-----0----------------0
50.15-------1010-----0----------------0
50.16-------0110-----0----------------0
50.17-------1110-----0----------------0
50.18-------0001-----25---------------0.25
50.19-------1001-----25---------------0.25
50.20-------0101-----25---------------0.25
50.21-------1101-----25---------------0.25
50.22-------0011-----0----------------0
50.23-------1011-----0----------------0
50.24-------0111-----0----------------0
50.25-------1111-----0----------------0

Now this is not a "bell" (Gaussian) distribution but a square shape distribution. I choose it first to demonstrate a point - before moving to the more realistic "bell" case. The total number of "bins" is 16=2^4. That is each student's measurement gives log2(16)=4 bits of information. In the words of David Hagan the envelope of possible observation events has a Shannon Entropy of 4 bits. Each student's measurement causes a collapse on to one specific event (bin) [ the content in David's envelope] that is expressed as 4 binary digits - and which we can refer to as having a Shannon information content of 4 bits. [When we talk about the "envelope" we should use the word "entropy", and when we talk about specified event(s) the word "content"]

We would be wrong, though, after a single measurement to conclude that we had 4 bit accuracy on the length of the rod. The distribution over 4 bins tells us this. In fact we only need 2 binary digits to specify the length (i.e the 3rd and the 4th) - so that the Shannon Information content is only 2 bits. In general we may use this formula for Shannon Information Content of a set of events:

I = log2{W} + Sum over i of [p(i)log2{p(i)}]

where W is the reference class size (16), and the second term is always negative [because p(i) is always less than 1]. In the above case one gets:

I=4 - 2 = 2 bits

Now we can proceed to the "bell" curve - with a possible example like this
Length(cm)--bin------No of stud---p(i)-----p(i)log2p(i)
50.10-------0000-----0------------0--------.0
50.11-------1000-----0------------0--------.0
50.12-------0100-----0------------0--------.0
50.13-------1100-----0------------0--------.0
50.14-------0010-----1------------0.01-----.0664
50.15-------1010-----2------------0.02-----.1129
50.16-------0110-----4------------0.04-----.1858
50.17-------1110-----7------------0.07-----.2686
50.18-------0001-----17-----------0.17-----.4347
50.19-------1001-----20-----------0.20-----.4644
50.20-------0101-----20-----------0.20-----.4644
50.21-------1101-----16-----------0.16-----.4230
50.22-------0011-----7------------0.07-----.2686
50.23-------1011-----3------------0.03-----.1518
50.24-------0111-----2------------0.02-----.1129
50.25-------1111-----1------------0.01-----.0664
-------------------------------------SUM = 3.02 bit

Thus according to the formula the information content is:

I=4-3.02= 0.98 bit

which is quite alot less than the square distribution - but of course the information content depends on the standard deviation. It also depends on the number of reference bins in the envelope. For example if we were to take bins from 0cm to 100cm in the same bin sizes (0.01cm) we would get 10,000 bins and a Shannon Entropy of 13.3bits. Now application of the equation gives us a Shannon information content of:

I= 13.3 - 3.02 = 10.28 bits.

This shows a much higher information content - and is perhaps more indicative of the specification to which we are measuring.

[ 06. May 2006, 13:10: Message edited by: Christopher D. Beling ]

IP: Logged
Christopher D. Beling
Member
Member # 723

Icon 1 posted 06. May 2006 13:27      Profile for Christopher D. Beling     Send New Private Message       Edit/Delete Post 
APPLICATION TO PROTEINS
In the previous post the "bell curve" was analysed in terms of information content - over a certain reference class of possibilities - with regard to measurement. This same analysis can be applied directly to the AA (Amino Acid) alphabet in proteins. The tabulation above with its 16 entries can in this case be seen as 16 AAs (I know in reality there are 20 - but for sake of ease of argument we can consider there to be only 16). At one particular protein site-

(i) there could be a very high specificity - i.e. only one AA can make the protein function - in this case the Shannon information content for that site is

I= log2(16)-log2(1) = 4 bits.

(ii) there could be 4 AAs that would do an equally good job at making the protein function (i.e. the square distribution above) in which case

I= log2(16)-log2(4) = 2 bits

(iii) there could be the "bell curve" of AAs with the probability in some way indicating the effectiveness of the respective AA a the site. In this case

I= log2(16)- 3.03 = 1bit

So here we are seeing a varying degree of information content per site of the protein chain. You can see we are dealing with only 1-4 bits of Shannon information content per site, but for a protein of 100 AA units, this amounts to 200-300 bits of info. Thats alot! The specification of a protein if tight - much tighter than the students measuring a length with 13 bit accuracy. But the important point is that this is just a change in the degree (amount) of Shannon information content in the object we are specifying (here the total protein - in the previous post the length of the rod).

[ 06. May 2006, 13:31: Message edited by: Christopher D. Beling ]

IP: Logged
David L. Hagen
Member
Member # 323

Icon 1 posted 06. May 2006 22:31      Profile for David L. Hagen   Email David L. Hagen   Send New Private Message       Edit/Delete Post 
Thanks for exploring this Chris.

On your measurement example, if the measurements are normally distributed, I think of the distribution as having a mean and a standard deviation. e.g., a relative deviation of the standard deviation divided by the mean or a resolution of the mean over the standard deviation.
(For a technical discusion see the NIST Technical Note 1297 by B.N. Taylor and C.E. Kuyatt NIST Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results)

I can understand how the uncertainty and resolution improve with the square root of the number of measurents of the rod.

quote:
but of course the information content depends on the standard deviation. It also depends on the number of reference bins in the envelope." In your example of going from 4 bins to 10,000 bins
I can see how the "envelope" is increased as the Shannon entropy or capacity to transmit information. However in both your examples,the "content" appears to be measured with the same resolution of 0.01 cm/100 cm or parts per 10,000 (though you have apparently increased the size of the "envelope" from 4 bins to 10,000. This confuses the example. Your 4 bins example does not seem to match the resolution of the data or vice versa. e.g., I would expect an example with a resolution of only 1/4.

The content of the message is the length of the rod with a mean and standard deviation. The same parts per 10,000 resolution information could be transmitted within a Shannon Entropy envelope ranging from 14, to 40, or to 400 bits long. Content <= Envelope.

[ 07. May 2006, 22:28: Message edited by: David L. Hagen ]

IP: Logged
Christopher D. Beling
Member
Member # 723

Icon 1 posted 08. May 2006 12:23      Profile for Christopher D. Beling     Send New Private Message       Edit/Delete Post 
David, I admit this is confusing. The same thing worries me - that the information transmitted on the length of the bar should not really depend on the reference class (envelope) of possibilities (i.e. no of bins - which I had increased from 16 bins to 10,000bins). As you well point out - one could keep on increasing the reference class size indefinately and in so doing make the measurement look more and more accurate. The confusing thing is that the measurement is not getting more accurate.

I think one sees this kind of thing in sway when one looks at "fine-tuning" in the cosmological setting. One defines a certain window in some physical parameter (delta), but one needs to know the maximum extent (R) over which the parameter can vary in order to tie down a probability (delta/R) for falling within delta - where one has to assume a uniform chance hypothesis. The number of bits associated with the single astrophysical parameter will be -log2(delta/R). As with my example this will depend on the value of R (annoyingly!).

One way out of this predicament is to take R to be the actual value of observed parameter (xbar) giving an info content of -log2(delta/xbar) bits. I guess this gives a lower bound for the specification in bits of the parameter, and is thus a conservative estimate. In the metal rod example given in my previous post - one could use the same strategy by working with the relative (fractional) error.

I(1)=-log2[sigma/xbar]=-log2[0.02/50]=11bits

and no more!

One would also expect that the information content of the measurement (as given by the mean) would increase with number N of measurements. Perhaps there is a formula something like

I(N)=-log2[sigma/(sqrtN*xbar)]

At present I cannot see how to show this and I feel there must be a way using information theory. I'm sure somebody somewhere must have written on this subject? Chris

[ 08. May 2006, 12:37: Message edited by: Christopher D. Beling ]

IP: Logged
David L. Hagen
Member
Member # 323

Icon 1 posted 08. May 2006 21:35      Profile for David L. Hagen   Email David L. Hagen   Send New Private Message       Edit/Delete Post 
Thanks Chris - much more believable.

Suggest inverting the order to ensure that sqrtN refers just to N and not N*xbar in your compact notation.

I(N)=-log2[sigma/(xbar*sqrtN)]

Your latest observation on information is key -
the measurement and resolution as "information content" is less than or equal to the Shannon Entropy (or "Shannon information").

i.e. I understand information content to be independent of Shannon Entropy except that the Shannon Entropy is an upper bound to the information content (or Complex Specified Information CSI to be precise).
-------------------------------------------------

On your Application to Proteins, recommend reviewing
Hubert P. Yockey "Information Theory, Evolution, and the Origin of Life" 2005 ISBN 113-978-0-521-80293-2
Yockey reviews Shannon (1948)'s theory in Ch 4, of sequences of length N from an alphabet of A symbols, the effect of differing probabilities of the alphabet A.

Yockey further develops the Shannon-McMillan-Brieman Theorem for calculating the actual "information content". e.g. for a sequence of 113 codons as 371.4 bits or 1.4 * 10^111 when ccounting for the variation in genetic codon probabilities (rather than 20^113 or 1.0*10^147 in a superficial calculation of genetic probabilities.)

(With numbers, digital or binary "alphabets" have an equal probabilty for all symbols so the Shannon-McMillan-Brieman theorem is not significant like it is for proteins.)

IP: Logged


All times are East Coast  
Post New Topic  Post A Reply Close Topic    Move Topic    Delete Topic    Top Topic next oldest topic   next newest topic
 - Printer-friendly view of this topic
Hop To:

Contact Us | ISCID

All content © ISCID and content contributor 2001-2003

The ISCID Forums are aimed at generating insight into the nature of complex systems (e.g. biological complexity, organizational complexity, etc.) and the ontological status of purpose, especially from the vantage point of various information- and design-theoretic models.

Indexed by UBB Spider Hack  |  Powered by Infopop Corporation UBB.classicTM 6.3.1.1

PCID | Encyclopedia | Brainstorms | The Archive | News | Essay Contests | Chat Events | Membership