“Heart Ache or Heart Attack” – Betting Your Life on Internet Voice Transmission

Animated Explanation of CODECs, Delay, Jitter, Echo & More

This tutorial is explained in a number of animations found at http://www.techtionary.com.

Jitter, stutter, echo and delay are among the major reasons for “road kill” in SIP-Session Initiation Protocol or VoIP-Voice over Internet Protocol.  Understanding the underlying causes is critical to designing a SIP that works and stays working.  There are many emerging stories on how VoIP is not sucking all the bandwidth out of networks and support is not the nightmare that users imagined.  Yet, many of the users surveyed are just beginning to explore many of the bandwidth “hogs” that VoIP offers such as multi-media conferencing, IM-Instant Messaging, collaboration screen sharing, animation, video and others.  Its not that the networks can’t handle it but can they provide a level of quality that engage rather than disengage users.   Let’s do some math.  A simple 30-second video clip of 200×200 pixels consumes approximately only 1.2M of bandwidth.  A high-resolution 30-second video clip of 1280×800 screen resolution needs more than 30 megabits of transmission.  Even if the transmission is compressed, 20 users needing to high-resolution conference or 600M of real-time bandwidth could possibly not only crash the LAN but clog or even knockout the WAN.

The purpose of this article and associated animations is to explanation how to design a network for SIP now and for the future to avoid network catastrophes. Here are some but certainly not all of the various types of delays (queuing) that can occur in a SIP network.

Coding-CODEC-compression/decompression/decoding – DSP-Digital Signal processing – compression and analog-to-digital processing

A CODEC-COder-DECoder (also known as an encoder-decoder and COmpression-DECompression system when used in video systems) is a computer chip (semiconductor) processing system designed to convert analog (human) signals to digital transmission and back to analog.  Source codecs are designed specifically for speech, whereas Waveform codecs work well with any type of sound.  That is, depending on the audio or voice application would drive the selection of the Source or Waveform CODEC.  Coding-encoding is the process of sampling quantities and putting them into digital values of voice, music or other sounds.  The Nyquist-Shannon sampling theorem states that the sampling frequency must be at least twice as high as the highest input frequency for the result to closely resemble the original signal.  A 4,000 Hz-Hertz voice pattern would be sampled at 8,000 BPS-Bits Per Second.  Organized into 23 or 24 separate voice channels (North American standards or 30 or 32 for E-1 European standards) with spacing bits (called Framing bits) separating each 24 segments of 8,000 bits becomes a T-1-Transmission Level One transmission circuit of 1,544,000 BPS.  For example, in MP3-Motion Picture Experts Groups Version 3 different compression (sampling or quantizing) rates are needed for different music quality levels such as 128 KBPS – CD quality (twice normal bandwidth), 96 KBPS – near-CD quality and 64 KBPS – FM radio quality.  Analog-to-digital conversion is performed by a special microprocessor and memory device called a DSP.  The function of the DSP-Digital Signal Processor/Processing is to convert analog signals into digital values (1s-ones & 0s-zeros) for transmission.  A DSP may also perform other functions such as encapsulating digital transmission (1s/0s) into packets such as RTP-Real-time Transport Protocol, Ethernet, ATM-Asynchronous Transfer Mode or other data formats.

Voice Frame Size in MSEC or voice sample size forces the SIP (using RTP-Real-time Transport Protocol) designer to balance voice quality with potential network delays which could cause jitter or voice packet loss.  Various CODECs sample at 1 msec (millisecond – 1/1000), 10, 20 and 30 msec.  The larger the voice sample, the better the quality however the greater the potential delay called latency from other network voice and/or data packets causing the voice to jitter.  SIP is the real-time communication protocol for VoIP-Voice over IP.  SIP has been expanded to support video and instant-messaging applications.  SIP is designed to perform basic call-control tasks, such as session call set up and tear down and signaling for features such as call hold, caller ID, conferencing and call transferring.  However with SIP, the intelligence for call setup and features resides on the SIP device or user agent, such as an IP phone or a PC with voice or instant-messaging software.  In contrast, traditional telephony or H.323-based telephony uses a model of intelligent, centralized phone switches with dumb phones with SS7-Signaling System 7 in PSTN-Public Switched Telephone Network telephone switching and H.323 or Media Gateway Control Protocol in IP telephony providing call control/routing.  While, there is a “modern” SIP standard, it is not finished.  In the past SIP has been proprietary among vendors.  Today, there is a recognized need for a new, comprehensive SIP standard to allow users to accomplish what they want to do, as opposed to vendors’ trying to force users to buy their proprietary products and work as the vendors want to force users to work. For more go to www.sipforum.org

Here are some of the ITU-International Telecommunications Union www.itu.org standards and others used for voice compression.  International SIP/VoIP applications suggest that lower quality voice is preferred as bandwidth is often very expensive or unavailable.

Standard     Algorithm               Bit Rates (KBPS)     Performance (Msec)

G.711           PCM                     48, 56, 64                  <<1 High

G.723           MPE/ACELP         5.3-6.3                      70-100 Low

G.728           LD-CELP              16                             <<2 High

G.729           CS-CELP              8                               25-35 Medium

G.729           ACS-CELP           8                               25-35 Medium

G.722           Sub-ADPCM         48, 56, 64                 <<2 High

G.726           ADPCM                16, 24, 32, 40           60 Medium

G.727           PCM                     16, 24, 32, 40           60 Medium

TrueVoice    Proprietary            2.4, 4                        Medium

MS Audio     LD-CELP               5, 10, 16, 22, 32,      40Medium-High

CS-ACELP – conjugate-structure algebraic-code excited linear predictive

ADPCM – adaptive differential pulse code modulation

LD-CELP – low-delay code excited linear prediction

Compression algorithms operate by sampling voice and quantizing the analog sound into digital values.  G.711 is based on traditional Nyquist-Shannon sampling theorem that the sampling frequency rate must be at least twice as high as the highest input frequency for the result to closely resemble the original signal.  A 4,000 Hz-Hertz voice pattern would be sampled at a rate of 8,000 BPS-Bits Per Second.   In G.711, there are two forms of companding (COMpression-EXPANDing) standards: µ-Law (µ from the Greek mu or M or Modulation) and the A-Law.  G.711 µ-Law compresses frames of 14-bit linear (additive) PCM-Pulse Code Modulation samples into frames of 8-bit logarithmic (multiplicative) PCM code words.  G.711 A-Law compresses 13-bit linear (additive) PCM samples into 8-bit logarithmic (multiplicative) PCM code words. 13 to 8 bits is A-Law transformation used in Europe, 14 to 8 bits is µ-Law transformation used in North America andJapan.  For example, natural steps increase in an additive fashion (linear scale) or in a multiplicative fashion (logarithmic scale).

Avoid Transcoding

Changing the number of bits sampled and quantized can dramatically impact the voice quality.  However, LAN-Local Are Network and WAN-Wide Area Network bandwidth limitations may have an equal or greater impact. Echo can also occur as a result of Asynchronous Transcoding. Transcoding is the process of conversion between circuit-switched (PSTN-Public Switched Telephone Network) and packet-switched networks such as Frame Relay, Internet, ATM-Asynchronous Transfer Mode. However, Asynchronous Transcoding is to be avoided.  According to Intel, “The term “asynchronous transcoding” refers to a situation when, for example, one endpoint is talking G.711 to another endpoint talking G.723 (two different encodings).”  The G.726 AD-PCM-Adaptive Differential Pulse Code Modulation codec processes voice in 2, 3, 4 (shown here), or 5 bits.  If the CODEC samples voice in 4-bits, only 32 BPS-Bits Per Second are needed.  That is G.711 uses 8-bits or 64,000 BPS.  However, half the number of bits (4) would need only half the bandwidth speed.  Instead of 24 voice channels, 4-bit samples would double the number of channels available to 48.  If the codec uses 2 bits, AD-PCM requires only 16-Kbps bandwidth, but the speech sounds distorted resulting in a low voice quality MOS-Mean Opinion Score or R-Rating.   G.726 is used when a low MOS-R is not important or when only low-speed circuits are available.  By transmitting the difference or change (or delta – Greek for change) in the value, instead of transmitting a full value, AD-PCM-Adaptive Differential Pulse Code Modulation provides high-quality speech at sub-PCM bit rates as low as 16 KBPS-Kilo Bits Per Second. Since most people do not change their voice much in 125 microseconds even longer same rates, this method works well.

Network Propagation Delay

Propagation or packet delay is simply taking the speed of light divided by distance.  For example, it takes approximately ~60 milliseconds to cross the continentalU.S.plus multi-hop (router) delays.  Without any delay the RTT-Round TripTime for a voice or data  transmission betweenBostonandSan Diego, a distance of 3,000 miles, would take ~32 msecs (milliseconds). Transmission line delay is typically 1 msec per 100 miles of cable.  Serial port interface delay (telephone set) can be from 1 msecs for a 128KBPS packet on a 2MBPS line to 130 msecs for a 1024K packet on a 64Kbps line.  SIP gateway(s) can be 50-100 msec and decompression delay is typically 10 msecs or less.  In other words, test the delays prior to implementation.  Delay is also influenced by human factors such as mouth-to-ear delay.


Echo is caused by three principle factors: talker echo, listener echo and loss of interaction (human and cultural influences).  Echo is the reflection of the original back to the sender.   There are many causes for echo but it can occur in many types of network including SIP.  SIP-Session Initiation Protocol is a signaling protocol for internet conferencing, telephony, presence, events notification (emergency calling) and instant messaging. Talker Echo disturbs the speaker who hears an attenuated and delayed echo of his/her voice.  This is caused by a reflection on the distant end. EL-Echo Loss is defined at the ratio of the power (voltage) of the arriving voice signal to the power of the reflected echo signal expressed in dB (deciBels).  If there is no echo, the loss is infinity.  Listener Echo also influences the speaker who hears the signal from the other party followed by an attenuated echo of the signal.  Listener Echo is caused both by a reflection close to the speaker and a reflection on the distant end.  Loss of interaction is frustration by either party from the echo on the line.  Hybrid Echo is an impedance mismatch within the 2-wire to 4-wire (T-1) conversion called the Hybrid.  Impedance is a measure (expressed in Ohm’s) of the opposition (resistance) to the flow of electricity and changes value as the frequency of the electricity changes. Capacitive reactance causes impedance to rise as the signal frequency decreases, whereas the inductive reactance causes impedance to rise as the frequency increases. The higher the impedance, the lower the current.   A typical POTS-Plain Old Telephone Service analog line is 600 Ohms. The bottom-line is the problem with echo it that drives people to use higher cost services because echo can destroy the ability of a person to process a conversation. For example, there is a big difference between the police shooter hearing the command, “Shoot!”  squeezing the trigger and taking the shot.  And not hearing the rest of the sentence, “only on my command.”  Or hearing someone say “I am having heart ache” and the voice is garbled due to delay, jitter and echo when what was really said is “I am having a heart attack.”  In other words, design the SIP/VoIP network as carefully as the words you choose to say.

In a SIP/VoIP network, the End, Short or Echo Tail is defined as the RTT-Round TripTime from the gateway to the hybrid (2-wire-4-wire-wire converter) and back to the gateway.  The time duration of the echo tail is referred as the Tail Length.    However, a hybrid circuit does not create a brick-wall echo.  A Brick Wall echo is an echo where the response of a far end signal would be an echoed signal.  The EC-Echo Canceller needs only to compensate for the circuit-switched (TDM-Time Division Multiplexing) segment of the call.  A Line Echo Canceller is used with Short Tail problems.   A Network Echo Canceller is used with Long Tail problems.  A T-1 is a 4-wire circuit – 2-wire incoming and 2-wire outgoing transmission.  A PBX trunk is a 2-wire circuit – incoming/outgoing on the 2-wires. Telephone line hybrids are four-winding (one primary and secondary for each for incoming and outgoing) transformers that provide a voltage-to-voltage conversion and internal isolation from the external telephone line.   For example, DC-Direct Current audio transformers are used convert DC voltage to audio or speech through a hybrid transformer called a speaker.  Hybrids also provide electrical (voltage-lightning) isolation protection for the customer.  Normally the voltages on telephone line are in order of -48 V-Volts, but in some special cases there can be higher voltages present on the telephone line or between the equipment and telephone line, so the hybrid transformer must withstand quite high voltages to be safe in such circumstances.  The function of the hybrid is to provide a 2-wire line interconnection coupling with the transmit and receive pairs of the 4-wire line found in a T-1 circuit.  Hybrid transformers come in two types: “wet” and “dry,” referring to whether they are designed to pass DC – wet transformers withstand DC currents without saturating, dry transformers do not.  Most modem-modulator-demodulator circuits use dry transformers.

Other Usual and Unusual Suspects

There are a few other places to investigate to determine the cause of problems.  Here are a few things to check:

Router tables can tell you how various types of packets and TCP-Transmission Control Protocols ports are processed.  TCP ports determine whether email, web surfing and other packets are processed, it at all, such as blocked as a firewall.  In addition, the network administrator can set the router to give priorities or queue packets for transmission.  Some of the more obscure reasons for SIP/VoIP problems result from delays in the telephone sets “rebooting” (some take 10 minutes to come back online).  Serialization is another unusual suspect often caused by various computer processing delays which could be in any device.  One of the more usual suspects is jitter/dejitter caused by buffer overload (router memory storage) or out-of-order packet processing/reprocessing as packets are processed out of arrival sequence.  Problems can also arise from the way packets are encapsulated or packetized.  IP packets containing ethernet packets while often the same are designed in variable packet sizes.  That is, encapsulation of Layer 5 RTP/SRTP — Layer 4 UDP — Layer 3  IP — Layer 2 Ethernet/Frame Relay/ATM/PPP/ISDN — Layer 1 Dialup/T-1/SONET is not performed the same way on different networks. In addition, routing protocols such as interior routing protocols such as OSPF, iBGP, IS-IS, RIP, RIPv2 can be a cause of voice quality problems.  Lastly, since voice is not the only packet type that can impact performance, continual network monitoring is critical.

These are some of the “got ya” hazards resulting from “never thought about that.”   That is, it is beyond the scope of this article to discuss vishing, DOS-Denial Of Service, SPIT-SPam over Internet Telephony and other “intentional” efforts by humans.  Designing a high performance, resilient and expandable LAN-Local Area Network, WAN-Wide Area Network with sufficient quality of service as well as hardware devices consisting of high performance telephone sets, switches, routers, cabling and other devices for SIP/VoIP as well as continual testing of these NE-Network Elements is critical to successful implementations of SIP/VoIP.






294 queries. 1.325 seconds.