floating-point basics

It requires 32 bits to store a floating point number in memory. For example,

represents number 5.0f. Computer architecture required to support floating point number is defined by the IEEE standard. Floating point number is computed by breaking 32 bits into 3 fields: sign, exponent, and mantissa. For example, the same number 5.0f can be seen the following way:

where red (1-bit) is a sign (1 means the number is negative); green (8-bits) is an exponent, and the rest (23-bits) is mantissa.

The formula to compute the real number is

Mantissa represents the fraction of a number where each bit must be divided by a power of 2, for example our number 5.0f will be calculated the following way:

Double precision numbers are encoded just the same way but they have twice as many bits as a floating point number (64-bits) where sign has 1-bit, exponent 11-bits, and mantissa the rest 52-bits. Instead of 127 exponent bias as in the floating point number, double precision bias is 1023. The formula would look like this:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: