Geographic coordinate validation and assignment using an edge-constrained layout

Electric grids with buses that are mapped to geographic latitude and longitude are useful for a growing number of applications, such as data visualization, geomagneti-cally induced current calculations, and multi-energy coupled infrastructure simulations. This paper presents a methodology for validating the quality of geographic coordinates for a power system model, and to assign coordinates to buses with missing or low-quality coordinates. This method takes advantage of geographic indicators already intrinsic to a grid model, such as branch length as implied by impedance and susceptance parameters. The coordinate assignment process uses an approach inspired by graph drawing, that lays out the vertices (buses) and edges (transmission lines), formulated as a nonlinear programming problem with soft edge length constraints. The layout method is very computationally fast and scalable to large power system cases. The method is demonstrated in this paper using a 37-bus test case and a 6717-bus test case, both publicly available, along with a large actual grid model. The results show that, for cases with only a few errors in the coordinates, cases with no coordinates known beforehand, and others in between, this method is able to assign reasonable geographic coordinates to best match known data about the grid.


Introduction
Geographic coordinates are not directly necessary to solve power flow solutions, optimal power flow, and transient stability simulation on electric power grids.Hence for the sake of simplicity and the economy of data storage, traditionally power flow cases have not contained information about the latitude and longitude of the physical substations in which the electrical buses are located.
There is a growing trend, however, in recent times, toward more cases including geographic coordinates, for several reasons.First, data visualization: geography is a natural starting point for representing a power system in a single-line diagram, or showing other data that varies over a wide area [1][2][3] (though not the only way [4]).Second, geomagnetically induced current (GIC) analysis, such as for geomagnetic disturbances (GMD) and electromagnetic pulse (EMP), requires geography to compute the impact of wide-area electric fields [5][6][7].Third, geographic embedding of power flow cases opens up opportunities to coordinate the analysis with other colocated information, such as locational weather data (especially cloud coverage and wind speeds for renewable integration) [8,9], communications networks [10], natural gas pipeline networks [11], transportation [12], and water [13].As a result, there are some examples of regions in the North American Electric Reliability Corporation (NERC) that require submitting geographic coordinates for some applications related to network planning [14,15].
A common challenge in working with wide-area electric transmission grid analysis is that, in many cases, engineers and researchers do not have readily available mapping of buses to high-quality geographic coordinates.In some cases, partial coordinates exist for a region of the case, or for the highest voltage network.In other cases, the coordinates are given at a very low level of precision.At a minimum, usually there are some buses in a system for which the coordinates are not given or are incorrect.If not properly considered, flagged, and if possible corrected, errors will propagate into analysis methods such as GIC calculations, leading to wrong conclusions, and data visualization may be misleading or not look right.
In this paper, we present a method to evaluate and improve the quality of a geographic embedding for an electric transmission system dataset, using information already intrinsic to the power flow case.The first part of this work presents validation metrics to assess a set of geographic coordinates, whether estimated from an algorithm or provided in advance.The metrics show ways in which the power flow data (specifically the transmission line and other branch parameters that indicate the length of the line) are or are not consistent with the given geographic coordinates.These metrics allow for assessing not only the quality of the geographic embedding as a whole but also the flagging of specific substation and line data that may contain errors.
The second, related contribution of this paper is that we introduce a graph-layoutbased methodology to assign new geographic coordinates to some or all buses that will match the underlying case data and satisfy the validation metrics well.This task is formulated as a nonlinear optimization problem with soft constraints that can be solved with an interior point solver quickly even for large systems.This method can apply whether most buses already have assumed coordinates or no coordinates are known at all, or anywhere in between.The paper demonstrates the effectiveness of these methods on example cases for different scenarios up to 6717 buses.
There are a number of potential benefits of this work.First, it can aid error detection and correction in network planning and the development of power flow cases.This would include correcting substation mapping and geographic coordinates, but in some cases could include updates to the power flow data itself if, for example, a line's shunt susceptance was flagged as too large for its (correct) geographic length.Second, in cases where low-quality or limited geographic information is available (or at least where some buses are not mapped), this method provides a quick way to get coordinates estimated for the rest of the buses, to allow for analysis that requires geography such as GIC simulations.Third, it supports the creation of better one-line diagrams and other data visualization, even for cases that have no pre-assigned geographic coordinates, by creating an initial set of coordinates that approximate the underlying actual geographic coordinates in the sense that they are consistent with the power flow data.
The outline of this paper is as follows.After a background survey of related work ("Background" section), the proposed methodology is presented in two parts: first, the framework for validating coordinates and identifying errors ("Assessing the quality of geographic mappings for electric grid cases" section), and then the optimization-based algorithm for laying out bus coordinates optimally ("Length-constrained graph layout" section).Demonstrations of applying the method to a variety of realistic scenarios and showing the method's effectiveness are given ("Results and discussion" section), wrapping up with some concluding thoughts ("Conclusions" section).

Background
Digital geo-mapping of energy infrastructure data has its origins in the mid-twentieth century with advances in computational power and the advent of geographic information systems (GIS), [16,17].As has already been mentioned, interest in GIS for power systems includes the ability to correlate with other geo-mapped data [8][9][10][11][12][13], which has implications for planning and operations, particularly with the growth of distributed generation (as in [18]) and increased attention to recovery from natural disasters [19].
Many of the efforts in the power system literature toward validating and improving geographic coordinates have been relatively recent and targeted at the distribution level.In [20], the researchers take advantage of a larger volume of smart meter data and the observation that voltages will be correlated with GIS information.Similar to other efforts to identify network topology and load phase connections, the voltage patterns over time can show errors in GIS data.In [21], graph theory processes are used to detect GIS errors for distribution, and similarly, in [22], the objective is to find errors in GIS data at the distribution level (particularly at the secondary level near the customers), using analysis that considers GIS data, network topology data, and customer physical addresses with a clustering-based procedure.A related effort applies image processing algorithms to improve system mapping [23].
A strong motivation for better transmission-level geographic coordinate mapping is improved visualization.Good diagrams augment system analysis data with a visual context and help engineers and others better diagnose problems and communicate results effectively.Graph drawing, generically, is a challenging problem, because in order to represent a network of edge-vertex relationships on a two-dimensional plane, the vertices must be assigned cartesian coordinates that match a number of visual metrics that are often directly in tension [24][25][26].For large-scale graphs, one family of visualization techniques is the force-directed approach, where graphs are modeled as physical systems with spring attraction along edges and electro-static repulsion between neighboring vertices [27].Another technique in generic graph drawing at a large scale is modeling the system with a hierarchical structure [28].With specific reference to electric grids, extensive effort has gone into automating transmission system visualization, with some of the earliest work recognizing that a unique aspect is that there are local substation diagrams that are then connected over a wide area [29].Automatic network layout algorithms include [30][31][32][33][34] and the author's work in [1], which use a variety of methods but typically involve either a modified force-directed approach, geographic coordinates as a baseline, or both.Work that specifically looks at visual quality without regard to geographic layout includes [35,36].These layout methods can apply not only to network diagrams but also to visualization of other datasets, as in mosaic tile displays [37,38].In addition, recent work has shown additional work in automatic network layouts using parallel fast methods [39] and linear programming [40].
The ubiquitous IEEE test cases did not originally contain geographic coordinates [41], although at least one recent effort has assigned coordinates to the RTS-96 case after the fact [42].In more recent development of electric transmission system test cases, there has been more of a focus on including them.Although much actual power system data is not available for public release due to its designation by the US Federal Energy Regulatory Commission (FERC) as Critical Energy Infrastructure Information (CEII) [43], some information merely about the location of critical infrastructure is more widely accessible, such as the location of generators greater than 1 MW from the EIA form 860 [44].New public test cases that include geographic information include the 20-buse GIC test case [45], recent large-scale synthetic grids [46,47], and the California test system targeted at extreme weather and wildfire studies [48].

Assessing the quality of geographic mappings for electric grid cases
How accurate geographic coordinates need to be in an electric grid case, and the consequences of inaccuracies, depends on the application.For wide-area system network visualization, small errors in mapping or location will not be noticeable.In fact, a bit of distortion may be introduced intentionally to show the electric structure more clearly.But any substations not mapped, or those mapped drastically wrong, will be either missing from the diagram or a distraction making it appear as if transmission lines are cutting thousands of miles across a case.For visualization applications in conjunction with other infrastructure and particularly with satellite or mapping datasets, a higher level of precision will be required.
GIC applications have a very similar pattern.Given the amount of uncertainty in the input data to GIC studies (see [49]), high levels of coordinate precision are not required.However, because the GIC levels are highly related to the length of the line, it is important to keep the geographic length of the line relatively consistent with the line's actual length and in the same general region.Uncaught major mapping errors can unintentionally inject large currents into the network and throw off the results.
Before continuing, one note should be made about geographic projections.Coordinates are usually given as degrees of latitude and longitude, which correspond to locations on the spheroid representing Earth.For the purposes of this paper, such coordinates are converted to planar cartesian coordinates using the Universal Transverse Mercator (UTM) projection [50,51], so that they are given as x, y where x is the "easting" in meters and y is the "northing" in meters.This allows direct calculation of distances.Across the scale of even the largest grids, errors in using UTM are significantly smaller than the typical size of a substation, not to mention the other sources of uncertainty in power system data.For "Length-constrained graph layout" section, coordinates are generated in x, y , and then are projected back into latitude and longitude by the inverse UTM projection.This is merely a choice of convenience; other projections such as state plane coordinate systems could be used as well-but using latitude and longitude as if they were cartesian coordinates is not good because the distance metrics would be invalid.
The rest of this section outlines the validation analysis for a case with some given geographic coordinates.First, the analytical observations are given; then, they are quantified into metrics that are used to assign quality flags to various power system data.The quality flag integer variable q is defined such that q = 0 indicates zero confidence in the accuracy of the associated data, with higher values of q indicating better data.The maxi- mum value of q is 3 for branches and 5 for buses.

Given geographic coordinates and substation mapping
Ordinarily, geographic coordinates are not assigned directly to buses, but buses are identified with an associated substation, which in turn has a geographic latitude and longitude (converted then to x, y for this analysis as described above).The first observation to be made in validating coordinates is that some coordinates can be immediately identified as incorrect.The validation process starts with finding the middle location of the system, defined as the median latitude and median longitude.All valid coordinates should be within a certain radius of that spot (depending on the known size of the system, say 1000 km).In particular, coordinates with (0, 0) are obviously missing.Any coordinates outside the acceptable range are marked with q = 0 from the beginning.In addition to this, another indicator is the apparent decimal points of precision of data.While it is possible that the substation may be exactly located on a whole-number line of latitude and longitude, more probably this is an indication that the quality is low to begin with.
Once these preliminary indicators have been assessed, the attention of the validation method turns to the network branches and their apparent length, ℓ = (x 1 − x 2 ) 2 + y 1 − y 2 2 (where the two buses are located at x 1 , y 1 and x 2 , y 2 . Ultimately, the validation of the bus geographic coordinates (absent other data like checking satellite imagery) is dependent upon the bus's relationship to other known coordinates via the branches.Buses not connected to any branches cannot be further validated, but these play little role in the system.

Network branches that are not transmission lines
Most branches in a bus-branch power system model represent either lines or transformers.Transformers are often directly labeled as such, but if not they can be quickly identified as those branches which connect buses that are labeled with different nominal voltage levels.Transformers ought to have a length of essentially zero (in some cases a transformer might include a small line to a neighboring substation and may have a short distance associated with it).There are also a (usually small) number of branches which do not represent transformers but also are not ordinary transmission lines.Although they connect two buses of the same voltage level, they are short connections within a substation and hence should also have a length of essentially zero.These can be sometimes difficult to distinguish from actual transmission lines, but can usually be identified by relatively low reactance (X), in many cases zero resistance (R), and in essentially all cases, zero shunt susceptance (B).If a case has been created using a Ward equivalent or similar approach from a larger case, there may be equivalent lines modeled.In these cases, there should be no expectation of the parameters corresponding to the geographic separation between the buses for equivalent lines.Unusual values like negative series impedances can be obtained through equivalencing.For the purposes of this analysis, the goal is to identify and ignore equivalent lines, as they do not provide any insight into the geographic accuracy of bus coordinates.Often equivalent lines are marked, for example by a circuit identifier of "EQ" or "99".Even if not, one typical feature is a large positive or negative series impedance with zero shunt susceptance (B).

Transmission lines
Transmission lines are the remaining branches, specified by per-unit reactance X , resist- ance R , and admittance B .Transmission lines will have a physical length which the parameters X, R , and B all correspond to.Prior work [52] has surveyed actual North American electric grids and provided a starting point for the expected per-unit, perdistance length for various categories of voltage levels.The crucial part of the corresponding table is repeated in Table 1.While the range is relatively wide (due in part to variations in line construction), this regularizing data can help to flag obviously invalid coordinates.
But the main way to know a line's length L is via the propagation time τ , if the values for X and B are relatively accurately known.
where we assume that the line's propagation speed v prop is very near the speed of light c = 3.0E8 m s .(Inductance and capacitance L line and C line marked with subscripts to dis- tinguish from length L .)Note that there is no need to convert from per-unit X and B in this equation as the base values for impedance and admittance will cancel.
A crucial caveat in calculating transmission line length L and comparing it to the straight line distance between the two buses is that transmission lines do not in general follow a straight line path.The actual length L is always longer than ℓ = (x 1 − x 2 ) 2 + y 1 − y 2 2 .So at least this analysis provides a maximum value for the distance between two buses that are connected with a transmission line.Given that transmission lines are often run as straight as reasonably possible, given constraints associated with geographic features and right-of-way access, buses too close together can be identified as well.

Quantifying bus coordinate quality
The following heuristic rules were put into place for branch validation related to ℓ , the distance between the substations, and L , the expected length of the line as determined by parameters.(See prior sections.For example, transformers have L = 0 .)Any lines that do not fit into the following categories are given q = 0.
These thresholds are heuristic and come from observations in the quality of real datasets.The 1 km threshold is the threshold below which quantifying the length and distinguishing from internal substation branches becomes more challenging, so these lines merely target the two buses being separated by a small distance ℓ .The threshold 40 km separates shorter length lines, where the line length could easily be double the straightline distance, from longer ones that will tend to be straighter as a whole.The 3000 km threshold helps to eliminate unrealistically long lines, regardless of whether they match the line parameters.Notice that more room is given for the straight-line to actual length ratio to be less than 1.0 than greater than 1.0.
Next, each bus is assigned a coordinate quality flag based on the branches connected to it (except any buses which are set automatically to q = 0 via the criteria in III.A).First, buses are grouped into sets that are connected by branches where L ≤ 1km and ℓ ≤ 2km .Within a group (either a single substation or a cluster of very close nearby sub- stations), the quality flags for all other connected branches are considered.The quality flag for the buses in that group is then set to the median value of the quality flags in the group, plus 1.The reason for using a median metric is that even one branch with q = 3 indicates that the spacing between a substation and at least one neighbor is within a good range.If all the q = 3 for every branch connected to the group, the group is set to q = 5.

Length-constrained graph layout
The goal of the layout algorithm is to assign geographic coordinates (x, y ) to all the buses in a power flow case, given input case data and potentially some or all input geographic coordinate data, taking into account the quality flags described in the previous section (together with any a priori knowledge about which coordinates or other data are more reliable).The method must be generally applicable and computationally feasible even for very large systems because the target application is engineers needing to assign or clean up coordinates before making a visual diagram or running a GIC or infrastructure study.
We structure this problem as a graph drawing problem, where there is an assumed graph topology (branches connecting buses).Broadly, four assumptions underpin the framework of our formulation:

Feasibility
The system this graph represents is a physical system, so there exists some correct set of geographic coordinates x i , y i for bus i that satisfies all legitimate branch constraints.This solution may not be unique, and it might not be essential to reach it exactly.

Regularization
Some or all buses have a guess for their geographic coordinate x i , y i with some confidence c i (for bus i ).Some buses have unknown locations where c i = 0 .If starting with no coordinates, pick one anchor bus and give it arbitrary coordinates.

Edge length constraints
Branches (edges) have an expected length L ij , which the separation ℓ ij between buses i and j should ideally approximate.Lines also have a scalar confidence level c ij , which could be zero, for example, for equivalent lines.The length L ij for transmission lines could be set to 80% of the line right-of-way path distance, to better represent the range of values ℓ could take.

Spread out
Subject to other constraints, the graph layout is spread out.Adjacent edges emanating from a bus have maximal angle separation.Buses that are far from each other by traversing the graph should also be far from each other spatially.
These assumptions are well-suited to formulation as a nonlinear programming problem, with soft constraints.First, define the objective portion for z i for any node i in the set of nodes i ∈ N .
Next, for any branch (i, j) in the set of actual branches i, j ∈ E 1 , define the objective portion based on deviation in distance.The way we approach assumption 4 is in two parts: local spread out and global spread out.Each results in creating new sets of edges.For local spread out we define new branches i, j ∈ E 2 that are second neighbors in the graph with edges E 1 .That is, two buses i and j are connected in E 2 if there is some bus k such that (i, k) ∈ E 1 and j, k ∈ E 1 but i, j / ∈ E 1 .For the local spread parameter a ij just use the length, with a single local spread scaling factor α for the whole system.
Similarly, the global spread-out assumption is handled with a third set of edges E 3 .These edges are formed by recursive binary partition of the graph associated with E 1 .
The binary partition works as follows.In each recursive iteration consider the nodeset N k .If there are less than 5 nodes in the set, return.Otherwise, select the two extreme points of the graph, i and j , which are the two nodes which are separated by the longest path length l ij along the length of the graph (found or approximated using Dijkstra's algorithm).Add i, j to E 3 , then partition the buses in N k into two subsets, N ki for buses closer to i along the length of the graph (again using Dijkstra) and N kj for buses closer to j .Recursively repeat for each subset until completed.
With the partition done, create objective components b ij for (i, j) proportional to both the length l ij and a system-wide global spreading factor β.
The reason these global scaling factors are proportional to length is that the partitions higher on the binary tree should spread out further (for example, the first pair will be the two furthest nodes on the graph).
With these pieces in place, the final optimization problem can be formulated as: with no hard constraints in a "subject to" clause.Hence we reduce the initial constrained coordinate assignment problem to an equivalent, unconstrained problem, optimizing over the control variables x i and y i (horizontal and vertical positions of all nodes (i ∈ N ) with the objective function minimizing two functions, separation from reference coordinates (z i ) and expected edge length z ij , defined above, weighted by parameters c .Simultaneously, the objective function seeks to maximize both local and global spreading with the a and b functions, defined above as well.This unconstrained, continuously differentiable problem is excellently suited to a standard nonlinear optimizer such as IPOPT, as implemented below.
A few observations can be made about this formulation.First, it is tunable depending on the system and confidence levels in the different data.The parameters c and β are unitless, whereas α would have length units (like meters).Higher values of α and β will cause the coordinates to spread out more, at the expense of deviating more from the known branch lengths.The second observation is that both the regularization and edge length terms are quadratic and tend to pull the system together, whereas both of the spread terms are linear and tend to push the system apart.Very broadly speaking, the user picks α and β to establish a constant "force" that sets the tolerable deviation from a priori coordinates and branch lengths, analogously to force-directed graph layout methods.The third observation is that the system is quite sparse.None of the three edge sets will have a size much greater than the original number of branches.Unlike a typical force-directed graph layout method, there is no need to calculate the distances between every pair of points.This has the effect of keeping the computational complexity low.

Results and discussion
In this section, we demonstrate the ability of the geographic quality assessment method (shortened to GQA in this section) to identify errors in bus coordinates, and for the edge length-constrained graph layout (shortened to LCL for this section) to determine new coordinates that are of high quality in terms of consistency with the power flow data.
The implementation of these methods is on a laptop with an 11th Gen Intel i7 processor at 2.5 GHz clock speed and 64 GB RAM.The non-linear optimization problem was formulated with the Pyomo platform and solved with the Interior Point Optimizer (IPOPT) [53,54].
A variety of test scenarios are used for the results in this section, with three main base grids: 1. Hawaii40.This 37-bus case is synthetic, built with an algorithm according to the methods described in [46] and [55].It does not correspond to any actual grid or contain CEII, so its data is made available at [47].It is geographically located on the Hawaiian island of Oahu.Since it is synthetic, it has ground truth coordinates that are consistent with the line parameters.2. Texas7k.This 6716-bus case is also synthetic [46,55], geo-located on the portion of the U.S. state of Texas served by the Electric Reliability Council of Texas (ERCOT).Like the Hawaii40 case, its data is available at [47], it does not contain CEII, and it has ground truth latitude and longitude.As this case is a realistic size, variations on it are used for the majority of the results in this paper.3. Grid3.This is an actual model of a portion of the electric grid located in North America, with about 5000 buses.It is used in the last section of results to verify the methodology against real data.Only high-level results can be given because the case contains CEII.It has a priori assumed coordinates, but they are not ground truth coordinates and there are some known issues with the data, which this algorithm is shown to address (see "Results for Grid3" section).
The scenarios for this analysis have been selected to mimic potential applications to real situations and to demonstrate the effectiveness of the GQA and LCL methods.

Missing coordinates in Hawaii40
The first examples are shown in the Hawaii40 case because it is small enough that individual nodes can be distinguished in the figures.The base case has ground truth coordinates which are known because of the way the synthetic grids are designed.In the design of the transmission lines (as described in [46]), the length is assumed to be 1.0 to 1.5 times the straight-line path between the substations, with parameters X and B set correspondingly depending on the tower design.So it is no surprise that the GQA process scores essentially all of the lines with a perfect quality flag of 3, and essentially all of the buses with a perfect quality flag of 5.There are a few exceptions for three lines added later without the correct process, which GQA flagged with q = 1.
The scenarios tested involved assuming that there was missing substation coordidata for a subset of the substations in the case.The GQA was run on the case with these missing coordinates, and then the LCL method was run to attempt to provide estimated coordinates for these substations, based only on the known line data and the relation to the remaining, correct substation coordinates.For the LCL algorithm in these cases, the branch confidence constants c ij were set to 1 for all branches, and the bus confidence constant c i was set to 10 for buses with q ≥ 1 .The spread con- straints were included with α = β = 0.001 .After the LCL was complete, for each case the GQA was run again to check the improvement in coordinate quality.
Table 2 and Fig. 1 show the results for the base case and three scenarios: one with coordinates assumed to be missing for 4 substations, one with 8 missing, and one with 16 missing.In Table 2, each scenario is shown with two rows, before and after the LCL algorithm assigns new coordinates to the buses.In all cases, the algorithm manages to find coordinates for the buses such that all the lines (except the three with known data challenges) have lengths that reflect their parameters, and as a result, nearly all the buses have perfect quality flags q = 5. Figure 1 shows where these coordinates are set in each case.Of course, there are many possible, valid solutions as the hints from power flow variables do not uniquely specify the coordinates.In most cases, the estimated coordinate is quite near the ground truth coordinate, separated sometimes by just a degree of freedom such as flipping over an axis.For the purposes of visualization or GIC calculations, these estimated coordinates would be better than having no coordinates or very wrong coordinates.

Fixing coordinate mapping errors in Texas7k
For the next set of scenarios, Texas7k is used, which has a size more commensurate with actual electric grid models.From the base case which is ground truth, varying levels of different types of errors in the coordinate mapping are considered.Four different types of errors in coordinates are considered.First, for some substations, the coordinates are assumed to be unknown.Second, for some buses, the mapping is assumed to be wrong, so that the bus is assigned to a different substation's coordinates, potentially on the other side of the case.Third, for some substations, the coordinates are assumed to be slightly wrong, by adding a random error on the order of 1° latitude and longitude.Fourth, for some substations, the coordinates are assumed to be rounded to the nearest degree.Note that in all cases the algorithm does not know a priori which coordinates are correct or incorrect, but estimates this using GQA.
Table 3 and Fig. 2 show the results for scenarios in this case.Six cases were run, with varying levels of errors from 5% up to 30%.In each one, the selection of the buses to have errors and the type of error were assumed to be random.In the LCL algorithm, the parameters were the same as for the Hawaii40 cases in the prior subsection, except that c i for buses with q = 1, q = 2, and q = 3 were changed to 0.01, 0.1, and 1 respectively to allow more freedom for the algorithm to improve these coordinates.These parameters are always a trade-off in how trusted the coordinates are and how strongly the power flow data indicates the coordinates should be changed.
Figure 2 illustrates what the LCL algorithm is doing.The ground truth coordinates (gray) are modified to simulate errors.The errors cannot be shown in Fig. 2 without majorly cluttering the image, since lines appear to be crisscrossing the whole case, plus some substations are assumed to have no coordinates.However, the fixed coordinates (red) are shown and tend to estimate the original coordinates very well.
As shown in Table 3, adding the errors at various levels can be detected by GQA, with the number of buses with q ≤ 1 approximately equal to the determined percentage of errors.Then with the coordinate estimation through LCL, the low-quality bus coordinates are greatly improved.Even with nearly 1/3 of buses incorrectly mapped, the LCL can find a solution with 94% of the buses having q = 4 or q = 5 , and none with q ≤ 1. Building coordinate sets from sparse starting points Next, six scenarios are considered to emulate the conditions in which very little is known about a case's geographic context.Texas 7 k is used for these as well.First, we look at the condition in which a single area is missing from the case.We look at the South Central Area (Austin and San Antonio region) being missing (NoSouthCent), and the South Area (Corpus Christi, Laredo, and Lower Rio Grande Valley region) (NoSouth).This is done so that both a central area being missing and an edge area being missing can both be tested.Second, we look at the condition in which only one area is known, in both of these cases as well (OnlySouthCent, OnlySouth).We then look at the conditions in which the extra high voltage (EHV) network (345 kV in this case) is known but the rest of the case needs to be inferred (OnlyEHV).Finally, we consider the case where none of the coordinates are known at all (AllUnknown).For these cases, in solving them with the LCL, the spread-out parameters are more important and are set to α = 1 for all cases, and β = 0.01 for the area missing cases and β = 0.3 for the other cases.These were the slowest cases to run computationally, but still none of them took longer than 2 min.
The results are shown in Table 4 and in Fig. 3.The two cases with only one area missing are not too unlike the prior subsection cases with 10-15% of substation errors, except that the substations with unknown coordinates are all together in one region.Therefore there is more deviation as a whole from the ground truth coordinates, as the top panel of Fig. 3 shows.Nevertheless, the overall shape of the missing area tends to match the actual coordinates fairly well, and the GQA results (Table 4) show a very high level of correlation between the estimated coordinates and the line parameters.
The results for the cases with only one area known certainly have more deviation from the ground truth coordinates, since these cases involve over 80% of the coordinates for the case unknown.But, thanks to the spreading mechanism, the different unknown areas still tend to separate from one another and form reasonable structures, as shown in the bottom panel of Fig. 3. Similarly, the cases with only the 345 kV network known or with none of the case known involve the LCL algorithm having to estimate coordinates from scratch, but it does find coordinates that result in relatively high quality flags for most buses.The main application of these scenarios would be for quick data visualization on a case with no readily available coordinates.

Results for Grid3
This subsection presents some results from Grid3, an actual electric grid subsystem case located in North America with about 5000 buses.The base case coordinate set is relatively high quality but is not fully ground truth as there are some errors, missing coordinates, and some low-resolution coordinates.The GQA results are shown in Table 5 (note that exact values are not given but just a percentage), with the q = 0 and q = 1 buses mainly being the ones with major errors or missing data, and the larger group of q = 2 Table 4 Texas7K results of GQA with sparse starts  buses (about 1/3 of all the buses) being the regions for which low-resolution rounded data were all that was available.With LCL applied, the coordinates are greatly improved in their correspondence with the power flow data, with no buses in the q ≤ 1 region and only very few in the q ≤ 3 region.
For reference, two other scenarios were run on the Grid3 case.The first was introducing additional intentional errors, much as in the "Fixing coordinate mapping errors in Texas7k" section, at the 10% level.These can be seen in the additional 10% of buses with q = 0 or q = 1 .The second scenario is with all coordinates unknown, starting from scratch, as in the "Building coordinate sets from sparse starting points" section.By comparing Table 5 to Tables 3 and 4, it is clear that the performance of the GLA and LCL on a real case is comparable to the results from the synthetic cases in Hawaii40 and Texas7k.

Conclusions
This paper addresses the problem that exists when a power system analysis task needs reasonable geographic coordinates for a grid and either (1) no such coordinates exist, or (2) some coordinates exist but others are missing, or (3) a set of coordinates exist but some are severely incorrect, or (4) a set of coordinates exist and are generally correct but the exact accuracy is not known.Given the importance of data visualization for large-scale power systems, the growing research efforts in preparing for potential GMD events, and the usefulness of applying other geo-mapped data in coordination with the electric grid, an automated process to create reasonable coordinates or fill in missing gaps can help to support further work in a number of areas.Of course, having actual known coordinates is always better if that is possible.But often such coordinates are not easy to obtain, or not easy to map to the buses in a given snapshot case, without extensive human labor.The work in this paper takes advantage of the fact that some geographic information is contained in the power flow data itself, particularly the branch impedance and susceptance parameters as indicators of branch length.By applying these constraints to a modified graph drawing algorithm, an optimization-based general approach can be made to estimate missing or incorrect coordinates.This algorithm is expected to perform well with respect to many practical cases since all actual cases are geographically embedded in reality.It is possible that in some unusual configurations or highly complex cases, there might be some challenges to the algorithm, particularly in tight downtown urban networks or fictitious test cases without a true geographic embedding.For practical implementation, the method is quite tolerant of data errors and does not require significant computational resources.The algorithm in this paper is scalable and fast and will result in reasonable coordinates that will respect the expected line lengths as well as keep any known accurate coordinates intact.

Methods
The purpose of the test cases described in the "Results and discussion" section was to evaluate the efficacy of the proposed geographic coordinate validation and lengthconstrained layout methods on large-scale, realistic power system network models.The study was designed with three test cases: Hawaii40, which has 40 buses, Texas7k, a

Page 9 of 20 Birchfield
Journal of Engineering and Applied Science (2024) 71:112

Fig. 1
Fig. 1 Results of coordinate estimation with LCL in Hawaii40.Gray is ground truth, red is estimation.Bolder dots highlight the substations whose coordinates were assumed to be missing.Top: 4 substations with missing coordinates.Middle: 8 substations with missing coordinates.Bottom: 16 substations with missing coordinates

Fig. 2
Fig. 2 Results of coordinate estimation with LCL in Texas7k, under the 15% error condition.Gray is ground truth, red is estimation.Bolder dots highlight the substations whose coordinates were assumed to have major errors.Top: full case view.Bottom: Zoomed-in view of the far western area

Scenario Number of buses with quality flag q =Fig. 3
Fig. 3 Results of coordinate estimation with LCL in Texas7k, under the sparse start conditions.Gray is ground truth, red is estimation.Top: NoSouth case, zoomed in for detailed view.Bottom: OnlySouthCent case, with the known area circled in blue

Table 3
Texas7K results of GQA with error correction

Table 5
Grid3 results of GQA