Statistical n-gram language models are traditionally developed using perplexity as a measure of goodness., However, perplexity often demonstrates a poor correlation with recognition improvements, mainly because ii fails to account for the acoustic confusability between words and for search errors in recognizer. In this paper, we study alternatives to perplexity for predicting language model performance, including other global features as well as a new approach that predicts, with a high correlation (0.96), performance differences associated with localized changes in language models given a recognition system. Experiments focus an the problem of augmenting in-domain Switchboard text with out-of-domain text from Wall Street Journal and Broadcast News that differ in both style and content from the in-domain data.