A scalable, fault-tolerant, distributed server architecture for Usenet news service

Next Back
#02 00:00:00

Applicability (or "Who cares about news?")

Content delivery is difficult thanks to the Bandwidth Crunch
Any distributed content delivery service is likely to be more efficient but requires significant thought and planning
Servers must continue to be economically feasible in order for providers to consider the service
Overall scalability, fault-tolerance, and distributed service techniques are equally applicable to other services such as Web caching
DSL/Cable customers who eat your news server bandwidth are not eating expensive transit bandwidth

History: the rationale behind abandoning the traditional paradigm

INN: the advantages
- Traditional design, many problems already solved even for a cluster configuration (i.e. XREPLIC)
- Immense level of server independence in the face of individual reader machine failure
- Plenty of documentation available
- Probably the most popular news server available
- Passes the test of time
- Development (now) continuing forward at a steady pace and many significant issues such as storage API and overview performance have been addressed
- Significant legacy code base of my own to exploit
INN: the disadvantages
- Full disk pile on each reader machine
  - Difficult to upgrade, requires readers to be removed from service for a long period to recover sufficient content
  - Expensive to expand, have to add equal numbers of disks on each machine
  - Low transaction utilization of binaries drives on each machine
- Lots of bandwidth required to keep each reader machine in sync, cluster configurations are fine but distributed environments are wasteful
- Monolithic, single-threaded design of innd can be a performance bottleneck
- Individual process per reader can be taxing on the server
- Somewhat dated
INN: attempts at future directions
- First attempt: building a dedicated binaries fileserver
  - Complications controlling expire policy
  - Devil is in the details
- Second attempt: abstracting the reader system
  - Too many legacy assumptions about the spool and storage
  - Matt Dillon begins work on dreaderd, which embodies most of what was needed to abstract the reader system as needed

Comparable works: what are other people doing to solve the problem?

NFS and INN: a good way to do a cluster?
- Many sites deploy a NetApp and NFS-mounted reader servers to serve readers
- The advantages:
  - Scales very well for a smaller, single site
  - Scales well at larger sites, until the NFS server begins to overload, at which time performance across the system as a whole plummets
  - Easy to implement, essentially the traditional model with a minor twist
  - Good resource sharing (binaries space benefits everyone)
- The disadvantages:
  - NFS server is a single point of failure, clients cannot mount multiple spools, both because the INN software doesn't understand how to do that and because NFS tends to cause clients to lock up when server goes away
  - NFS transaction load can only scale so far, server can only scale so far
  - History file transaction load can drive the NFS server to its knees, and since the history is required by clients in a "read only" mode, this must be on NFS
  - NFS server can only scale to so many disks
  - Once that limit is exceeded, back to designing a multi-spool cluster
  - NFS is very chatty and inefficient network-wise, meaning that remote reader machines are not feasible across WAN links
  - NFS operations are much more expensive than local disk operations, capping performance even on a local LAN
  - That dictates a cluster design where all readers are close to the NFS server, meaning that the reader machines may not be close to the client
Typhoon: a good way to provide news service?
- The advantages:
  - Low maintenance, easy to design and set up
  - Very high performance on a single machine
  - Many operational parameters can be applied on a per host/network/user basis
  - Very reliable in a single server situation
- The disadvantages:
  - Deemed to be too expensive, per-connect licensing
  - Closed-source
  - May not be as optimal in a distributed environment
  - Chaining does not offer any kind of resilence if the back end master server goes away or fails
  - Limited platform availability

Other options

No other obvious options that allow near-linear scaling of an operation

Back to the drawing board: ideal system requirements

Distributed: server should transparently reside near the end user, while not eating up unnecessary bandwidth
Fault-tolerant: system should be able to provide continuous service through either a planned or unplanned outage of any component
Inexpensive: build redundantly using less expensive FreeBSD-based PC architecture, yet total cost for servers should remain less than that of traditional non-Intel based UNIX server hardware
Scalable: should be able to handle additional users by the addition of more front-end reader machines, upgrading to larger reader machines, or in the case of the storage subsystem, the simple addition of disks
Very large scale: should be able to scale into the multi-terabyte range at a reasonable cost
Other: should be able to handle high-bandwidth users, including the high demands of DSL and cable subscribers, with ease

The paradigm shifts involved

Spool: move away from a single, local spool, to a remote NNTP model
- Reader no longer has the responsibility of filing and storing individual articles while at the same time taking care of clients trying to read news
- Allows for redundancy and topographic distribution of spools and front end reader servers
- Allows for clever engineering, such as having one central spool with really long retention, and multiple smaller distributed spools: spend additional bandwidth to reduce the cost (and capacity) of remote spools, or vice versa
- Use efficient NNTP for transit instead of NFS, retrieving by Message-ID: instead of by /news/spool/path, eliminating the need to have the same software at each level of the new distributed network
- Spool server becomes conceptually trivial, being simply a network appliance that knows how to store and retrieve Message-ID's, which can be done with INN, Typhoon, or other packages in addition to/instead of Diablo
- Spool becomes the major big ticket item in this model, and may be shared among many reader machines (even across WAN links)
- Downside: history lookup becomes a potential bottleneck, as all article operations now involve a history lookup, although new techniques are substantially less demanding than traditional history mechanisms
Reader: move towards maintaining overview and handling end-user
- Reader specializes in handling just the overviews, which simplifies code and reduces the complexity/cost of I/O subsystem
- Reader retrieves articles from spool server by fetching Message-ID from overview and iterating through spool servers looking for it
- Reader optionally caches articles, potentially increasing performance for second-hit while lowering actual bandwidth requirements
- Reader only requires a "header only" feed to populate overviews and then also access to spool server, insignificant amount of bandwidth for baseline operations plus a maximum of whatever the client would have spent to fetch the article from a remote server under other design models
- By transferring the article once on to the reader, and using caching, bandwidth savings could also be seen as many readers may be downloading the same articles over and over again
- Cost to implement a reader plummets, allowing for more readers, and cost to maintain a reader (bandwidth) drops, allowing them to be placed closer to end-user
- This also increases the peformance perception to the end user as the reader server is positioned close to the end user

High level overview of the implementation

Build transit servers up at strategic network points with large quantities of bandwidth available
Build spool servers at these same strategic points
- Build redundancy at the server level, rather than trying to leverage RAID5 and take a performance hit
- Multiple spool servers less likely to fail than a single RAID5
Data flows from Usenet to transit servers, from there back to Usenet and also to spool servers, with the transit servers separating the content by classification (text vs. binaries) and feeding to the appropriate spool servers
Build a centralized "infeed" system to handle article numbering and spam filter policy
Data flows from spool servers to infeed system (feeds from remote spool clusters are delayed)
- Less data transited around WAN, however Xref: data not present on spools
- Infeed system non-redundant due to requirement that all articles be numbered in a monotonically increasing fashion (area for future improvement)
Build and distribute individual reader machines
Headers flow from infeed system to individual reader machines
Outbound posts return to one of two outbound post processing servers for logging and spam-filtering, and then to transit servers, and onto Usenet itself

Lower level details of the implementation

Centralized configuration management
- A way to update config files without logging in on dozens of machines
- Setting up per-machine variables for items such as the closest spool server
Load balancing
- DNS used rather than a protocol-level redirector product
- Coarse load balancing possible with minor changes to Diablo to report current utilization statistics, combined with a nameserver that algorithmically generates server lists based on source IP address
Private networking
- Use private ethernet or, better, ATM for communications within the server system
  - ATM: large (9K) packet size, allows routing independent of IP network and avoids loading down IP routers
  - Lowers the load on server IP stack, 1MB article at 1500 bytes = 699 packets, at 9180 bytes = 114
  - ATM cell tax is sort of a downside, but for an ISP whose backbone is already ATM, no worse than the routed IP network scenario
Caching
- Readers capable of caching articles if desired
- Mid-level caches deployable at strategic spots in lieu of a full set of spool servers

Server design

Standardized server platforms using rack-mount PC cases, a small number of base platform types, and swappable drive modules
- All hardware of same type, simplifying OS build process, minimal individual customization of machines
- Rapid replacement of broken system and/or upgradeability of a too-slow system via chassis swap
- OS encapsulated on one drive, data on remainder, allows for rapid update of OS by module replacement in the field
Reader, cache, and infeed machines are mid-level server with 9GB boot and two fast 18GB data drives, striped for /news
Spool server machines are high-end servers with external disk shelves (2 shelves x 9 drives x 18GB for text, 4 x 9 x 50GB for binaries, 1 SCSI bus per shelf)
- One minor concession to on-machine redundancy: since the text spool can retain ~180 days of text, losing the history would be a pain, so the /news partition is mirrored
- Use Diablo spooldir patch to create multiple spool drives, so that a loss of one drive does not wreck the entire spool, but rather only a portion of it
- Take small portion of each data drive, stripe and mirror, to create a relatively small but very fast /news partition optimized for history lookups
- History lookups are not a serious issue, and compared to the traditional Usenet storage model, are quite fast in comparison
- Lower the number of inodes, and use techniques on the transit servers to minimize the number of required inodes
  - Faster crash recovery
  - Faster newfs too

Problems that came up

Diablo refused to serve articles during weekly history rebuild Clients would receive "Article not found"
- Thanks to redundancy, simply stagger the days on which history is rebuilt
- Newer solution involves marking a feed object as read- only in dnewsfeeds
Network problems between readers and spool servers would tend to decimate performance
- Duplex issues on Ethernet
- ATM/Cisco cell loss on ATM
- Routing policy changes between network portions that are not directly connected, but rather go over the public net
- Make sure the network works!
- Use ATM directly where possible to avoid Cisco ATM issues
Binaries - rapidly decreasing expire times
- Must keep readers up-to-date with the available retention, or clients get "Article not found" for the oldest articles listed on spool
Open server - a fantastic resource for system testing
alt.* disappears
- Software defaults assumed that dreaderd would be running in tandem with diablo, automated maintenance meant for dreaderd nukes all of alt.* on my master numbering server because diablo doesn't tweak LMTS

Possible improvements to diablo

Redesigned spool storage mechanism - per-group/type storage
Redesigned dreaderd caching system - possibly use diablo format
Access control improvements - fetching per-user/class options, additional ACL types such as DB methods
Provide dreaderd with hints as to which spool(s) are most likely to contain an article
Better detection of dead or misbehaving spool servers

Other distributed news architectures

Matt's model - ISP buys caching dreaderd and gets head feed
Resource sharing model - ISP's of similar size share resources for redundancy
Alternative fetch model - ISP uses some other source for the binaries articles - maybe an outsourcer for older articles

Future news directions

Thoughts on a caching news proxy

Questions

Clever possibilities: get a Supernews account and use Supernews as your long term binaries backing store, thanks to MID retrieval methodology Or share with another ISP. Or leech off your connectivity provider