Thursday, May 2, 2013

The State of Mobile VOIP, part 2: SIP

I'm going to start this series discussing the grand daddy of videoconferencing - SIP. SIP stands for Session Initiation Protocol, which only does call-setup, but the acronym has come to refer to the entire protocol family. The family includes SDP - Session Description Protocol - which describes the media protocols (G.711, Speex, etc.) that each client supports - and RTP - real-time protocol - which is like a real-time, multiplexing version of UDP.

SIP was designed by data comm guys so it "fits" quite well in the Internet world. It is extensible, unlike the proprietary counterparts. It is fairly simple (despite its reputation). In fact, the packets are ASCII text. SIP typically runs over UDP port 5060.

Because SIP uses UDP, and because it was designed before NAT became prevalent, SIP has to play games in order to work through a NAT router. When SIP sends out its initiation packet, it puts the port number that media packets should be sent back to on the originating host. Since this port is mangled by NAT routers, SIP stacks must first figure out which port the outside world sees.

There are various protocols for discovering this. Commonly used is STUN - Simple Traversal of UDP through NAT. First, the stack will send a packet originating from the media port to an external server, typically port 3478. This server will reply with the port that it saw the packet from i.e. the port that NAT switched the packet to.

When the SIP stack gets this reply, it can put the externally reachable port into the SIP initiation packet. This will allow the called host to send media packets back to the originating host.

[ pictures! ]

As mentioned before, a SIP packet will contain an SDP payload. This message lists all the codecs that the caller supports. The called host will reply with the common codec that it prefers, if it accepts the call.

Once the call is established, packets travel directly between the two endstations. This is important, and different from protocols like Skype and Redphone where the packets first travel to a central server and then out to the recipient.

SIP has a variety of ways that it can encode the sound of voice into data. The classic way is to just do what the phone company does: G.711 or 8 bit logarithmic samples at 8kHz. More advanced methods can produce much better quality at a lower bitrate. They work by simply modelling the configuration of the human vocal path (commonly known as Linear Predictive Coding.) These include the granddaddy GSM, and newer codecs like CELP and SILK.

Calls may be encrypted in one of at least two ways - SRTP and ZRTP. SRTP uses SSL. ZRTP uses a public-key system not unlike PGP. No surprise it was invented by Phillip Zimmerman.

Next time: SIP clients for Android.

No comments:

Post a Comment