Visual cues from the speaker's face influence the perception of speech. An example of this influence is demonstrated by the McGurk-effect where illusory (cross-modal) sounds are perceived following presentation of incongruent audio-visual (AV) stimuli. Previous studies report the engagement of specific cortical modules that are spatially distributed during cross-modal perception. However, the limits of the underlying representational space and the cortical network mechanisms remain unclear. In this combined psychophysical and electroencephalography (EEG) study, the participants reported their perception while listening to a set of synchronous and asynchronous incongruent AV stimuli. We identified the neural representation of subjective cross-modal perception at different organizational levels-at specific locations in sensor space and at the level of the large-scale brain network estimated from between-sensor interactions. We identified an enhanced positivity in the event-related potential peak around 300 ms following stimulus onset associated with cross-modal perception. At the spectral level, cross-modal perception involved an overall decrease in power at the frontal and temporal regions at multiple frequency bands and at all AV lags, along with an increased power at the occipital scalp region for synchronous AV stimuli. At the level of large-scale neuronal networks, enhanced functional connectivity at the gamma band involving frontal regions serves as a marker of AV integration. Thus, we report in one single study that segregation of information processing at individual brain locations and integration of information over candidate brain networks underlie multisensory speech perception. © 2018 by Koninklijke Brill NV, Leiden, The Netherlands.