The pioneering works in Agent-Based Modeling (ABM) - notably Schelling (1969) and Epstein and Axtell (1996) - introduced the method for testing hypotheses in "complex thought experiments" (Cederman 1997, 55). Although purely theoretical experiments can be important, the empirical orientation of the social sciences demands that the gap between modeled "thought experiments" and empirical data be as narrow as possible. In an ideal setting, an underlying theory of real-world processes would be tested directly with empirical data, according to commonly accepted technical and methodological standards. A possible procedure for narrowing the gap between theoretical assumptions and empirical data comparison is presented in this paper. It introduces a two-stage process of optimizing a model and then reviewing it critically, both from a quantitative and qualitative point of view. This procedure systematically improves a model's performance until the inherent limitations of the underlying theory become evident. The reference model used for this purpose simulates air traffic movements in the approach area of JFK International Airport in New York. This phenomenon was chosen because it provides a testbed for evaluating an empirical ABM in an application of sufficient complexity. The congruence between model and reality is expressed in simple distance measurements and is visually contrasted in Google Earth. Context knowledge about the driving forces behind controlled approaches and genetic optimization techniques are used to optimize the results within the range of the underlying theory. The repeated evaluation of a model's 'fitness' - defined as the ability to hit a set of empirical data points - serves as a feedback mechanism that corrects its parameter settings. The successful application of this approach is demonstrated and the procedure could be applied to other domains.