Next Back
#02 00:00:00
-
Applicability (or "Who cares about news?")
-
Content delivery is difficult thanks to the Bandwidth Crunch
-
Any distributed content delivery service is likely to be more
efficient but requires significant thought and planning
-
Servers must continue to be economically feasible in order for
providers to consider the service
-
Overall scalability, fault-tolerance, and distributed service
techniques are equally applicable to other services such as
Web caching
-
DSL/Cable customers who eat your news server bandwidth are not
eating expensive transit bandwidth
-
History: the rationale behind abandoning the traditional paradigm
-
INN: the advantages
-
Traditional design, many problems already solved even for
a cluster configuration (i.e. XREPLIC)
-
Immense level of server independence in the face of
individual reader machine failure
-
Plenty of documentation available
-
Probably the most popular news server available
-
Passes the test of time
-
Development (now) continuing forward at a steady pace
and many significant issues such as storage API and
overview performance have been addressed
-
Significant legacy code base of my own to exploit
-
INN: the disadvantages
-
Full disk pile on each reader machine
-
Difficult to upgrade, requires readers to be removed
from service for a long period to recover sufficient
content
-
Expensive to expand, have to add equal numbers of
disks on each machine
-
Low transaction utilization of binaries drives on
each machine
-
Lots of bandwidth required to keep each reader machine
in sync, cluster configurations are fine but distributed
environments are wasteful
-
Monolithic, single-threaded design of innd can be a
performance bottleneck
-
Individual process per reader can be taxing on the server
-
Somewhat dated
-
INN: attempts at future directions
-
First attempt: building a dedicated binaries fileserver
-
Complications controlling expire policy
-
Devil is in the details
-
Second attempt: abstracting the reader system
-
Too many legacy assumptions about the spool and
storage
-
Matt Dillon begins work on dreaderd, which embodies
most of what was needed to abstract the reader
system as needed
-
Comparable works: what are other people doing to solve the problem?
-
NFS and INN: a good way to do a cluster?
-
Many sites deploy a NetApp and NFS-mounted reader servers
to serve readers
-
The advantages:
-
Scales very well for a smaller, single site
-
Scales well at larger sites, until the NFS server
begins to overload, at which time performance across
the system as a whole plummets
-
Easy to implement, essentially the traditional
model with a minor twist
-
Good resource sharing (binaries space benefits
everyone)
-
The disadvantages:
-
NFS server is a single point of failure, clients
cannot mount multiple spools, both because the INN
software doesn't understand how to do that and
because NFS tends to cause clients to lock up when
server goes away
-
NFS transaction load can only scale so far, server
can only scale so far
-
History file transaction load can drive the NFS
server to its knees, and since the history is required
by clients in a "read only" mode, this must be on NFS
-
NFS server can only scale to so many disks
-
Once that limit is exceeded, back to designing a
multi-spool cluster
-
NFS is very chatty and inefficient network-wise,
meaning that remote reader machines are not feasible
across WAN links
-
NFS operations are much more expensive than local disk
operations, capping performance even on a local LAN
-
That dictates a cluster design where all readers are
close to the NFS server, meaning that the reader
machines may not be close to the client
-
Typhoon: a good way to provide news service?
-
The advantages:
-
Low maintenance, easy to design and set up
-
Very high performance on a single machine
-
Many operational parameters can be applied on a per
host/network/user basis
-
Very reliable in a single server situation
-
The disadvantages:
-
Deemed to be too expensive, per-connect licensing
-
Closed-source
-
May not be as optimal in a distributed environment
-
Chaining does not offer any kind of resilence if the
back end master server goes away or fails
-
Limited platform availability
-
Other options
-
No other obvious options that allow near-linear scaling of an
operation
-
Back to the drawing board: ideal system requirements
-
Distributed: server should transparently reside near the end user,
while not eating up unnecessary bandwidth
-
Fault-tolerant: system should be able to provide continuous service
through either a planned or unplanned outage of any component
-
Inexpensive: build redundantly using less expensive FreeBSD-based PC
architecture, yet total cost for servers should remain less than that
of traditional non-Intel based UNIX server hardware
-
Scalable: should be able to handle additional users by the addition
of more front-end reader machines, upgrading to larger reader
machines, or in the case of the storage subsystem, the simple addition
of disks
-
Very large scale: should be able to scale into the multi-terabyte
range at a reasonable cost
-
Other: should be able to handle high-bandwidth users, including the
high demands of DSL and cable subscribers, with ease
-
The paradigm shifts involved
-
Spool: move away from a single, local spool, to a remote NNTP model
-
Reader no longer has the responsibility of filing and
storing individual articles while at the same time taking
care of clients trying to read news
-
Allows for redundancy and topographic distribution of spools
and front end reader servers
-
Allows for clever engineering, such as having one central
spool with really long retention, and multiple smaller
distributed spools: spend additional bandwidth to reduce
the cost (and capacity) of remote spools, or vice versa
-
Use efficient NNTP for transit instead of NFS, retrieving
by Message-ID: instead of by /news/spool/path, eliminating the
need to have the same software at each level of the new
distributed network
-
Spool server becomes conceptually trivial, being simply a
network appliance that knows how to store and retrieve
Message-ID's, which can be done with INN, Typhoon, or other
packages in addition to/instead of Diablo
-
Spool becomes the major big ticket item in this model, and
may be shared among many reader machines (even across WAN
links)
-
Downside: history lookup becomes a potential bottleneck, as
all article operations now involve a history lookup, although
new techniques are substantially less demanding than
traditional history mechanisms
-
Reader: move towards maintaining overview and handling end-user
-
Reader specializes in handling just the overviews, which
simplifies code and reduces the complexity/cost of I/O
subsystem
-
Reader retrieves articles from spool server by fetching
Message-ID from overview and iterating through spool
servers looking for it
-
Reader optionally caches articles, potentially increasing
performance for second-hit while lowering actual bandwidth
requirements
-
Reader only requires a "header only" feed to populate
overviews and then also access to spool server,
insignificant amount of bandwidth for baseline operations
plus a maximum of whatever the client would have spent
to fetch the article from a remote server under other
design models
-
By transferring the article once on to the reader, and using
caching, bandwidth savings could also be seen as many readers
may be downloading the same articles over and over again
-
Cost to implement a reader plummets, allowing for more
readers, and cost to maintain a reader (bandwidth) drops,
allowing them to be placed closer to end-user
-
This also increases the peformance perception to the end user
as the reader server is positioned close to the end user
-
High level overview of the implementation
-
Build transit servers up at strategic network points with large
quantities of bandwidth available
-
Build spool servers at these same strategic points
Design tradeoffs:
-
Build redundancy at the server level, rather than trying
to leverage RAID5 and take a performance hit
-
Multiple spool servers less likely to fail than a single
RAID5
-
Data flows from Usenet to transit servers, from there back to
Usenet and also to spool servers, with the transit servers
separating the content by classification (text vs. binaries) and
feeding to the appropriate spool servers
-
Build a centralized "infeed" system to handle article numbering
and spam filter policy
-
Data flows from spool servers to infeed system (feeds from remote
spool clusters are delayed)
Design tradeoffs:
-
Less data transited around WAN, however Xref: data not
present on spools
-
Infeed system non-redundant due to requirement that all
articles be numbered in a monotonically increasing fashion
(area for future improvement)
-
Build and distribute individual reader machines
-
Headers flow from infeed system to individual reader machines
-
Outbound posts return to one of two outbound post processing
servers for logging and spam-filtering, and then to transit
servers, and onto Usenet itself
-
Lower level details of the implementation
-
Centralized configuration management
-
A way to update config files without logging in on dozens
of machines
-
Setting up per-machine variables for items such as the
closest spool server
-
Load balancing
-
DNS used rather than a protocol-level redirector product
-
Coarse load balancing possible with minor changes to Diablo
to report current utilization statistics, combined with a
nameserver that algorithmically generates server lists based
on source IP address
-
Private networking
-
Use private ethernet or, better, ATM for communications
within the server system
-
ATM: large (9K) packet size, allows routing
independent of IP network and avoids loading down
IP routers
-
Lowers the load on server IP stack, 1MB article
at 1500 bytes = 699 packets, at 9180 bytes = 114
-
ATM cell tax is sort of a downside, but for an
ISP whose backbone is already ATM, no worse than
the routed IP network scenario
-
Caching
-
Readers capable of caching articles if desired
-
Mid-level caches deployable at strategic spots in lieu of
a full set of spool servers
-
Server design
-
Standardized server platforms using rack-mount PC cases, a small
number of base platform types, and swappable drive modules
-
All hardware of same type, simplifying OS build process,
minimal individual customization of machines
-
Rapid replacement of broken system and/or upgradeability of
a too-slow system via chassis swap
-
OS encapsulated on one drive, data on remainder, allows for
rapid update of OS by module replacement in the field
-
Reader, cache, and infeed machines are mid-level server with 9GB
boot and two fast 18GB data drives, striped for /news
-
Spool server machines are high-end servers with external disk
shelves (2 shelves x 9 drives x 18GB for text, 4 x 9 x 50GB for
binaries, 1 SCSI bus per shelf)
-
One minor concession to on-machine redundancy: since the
text spool can retain ~180 days of text, losing the history
would be a pain, so the /news partition is mirrored
-
Use Diablo spooldir patch to create multiple spool drives,
so that a loss of one drive does not wreck the entire
spool, but rather only a portion of it
-
Take small portion of each data drive, stripe and mirror,
to create a relatively small but very fast /news partition
optimized for history lookups
-
History lookups are not a serious issue, and compared to
the traditional Usenet storage model, are quite fast in
comparison
-
Lower the number of inodes, and use techniques on the
transit servers to minimize the number of required inodes
-
Faster crash recovery
-
Faster newfs too
-
Problems that came up
-
Diablo refused to serve articles during weekly history rebuild
Clients would receive "Article not found"
-
Thanks to redundancy, simply stagger the days on which
history is rebuilt
-
Newer solution involves marking a feed object as read-
only in dnewsfeeds
-
Network problems between readers and spool servers would tend to
decimate performance
-
Duplex issues on Ethernet
-
ATM/Cisco cell loss on ATM
-
Routing policy changes between network portions that are
not directly connected, but rather go over the public net
-
Make sure the network works!
-
Use ATM directly where possible to avoid Cisco ATM issues
-
Binaries - rapidly decreasing expire times
-
Must keep readers up-to-date with the available retention,
or clients get "Article not found" for the oldest articles
listed on spool
-
Open server - a fantastic resource for system testing
-
alt.* disappears
-
Software defaults assumed that dreaderd would be running
in tandem with diablo, automated maintenance meant for
dreaderd nukes all of alt.* on my master numbering server
because diablo doesn't tweak LMTS
-
Possible improvements to diablo
-
Redesigned spool storage mechanism - per-group/type storage
-
Redesigned dreaderd caching system - possibly use diablo format
-
Access control improvements - fetching per-user/class options,
additional ACL types such as DB methods
-
Provide dreaderd with hints as to which spool(s) are most likely
to contain an article
-
Better detection of dead or misbehaving spool servers
-
Other distributed news architectures
-
Matt's model - ISP buys caching dreaderd and gets head feed
-
Resource sharing model - ISP's of similar size share resources
for redundancy
-
Alternative fetch model - ISP uses some other source for the
binaries articles - maybe an outsourcer for older articles
-
Future news directions
-
Thoughts on a caching news proxy
-
Questions
Clever possibilities: get a Supernews account and use Supernews as your
long term binaries backing store, thanks to MID retrieval methodology
Or share with another ISP. Or leech off your connectivity provider