xsharp.eu

Posted: **Fri May 21, 2021 1:55 pm**

Thank you, Dick. I figured out the same when I tried the overload-method in VO. Yesterday I started this strong-typing-throughout-the-app project and found a surprising number of errors, mostly strings called as numerics and vice versa. All were in little-used sections, which is probably why I never heard anything from end users. And because of the lack of strong typing, X# didn't notice the errors either.

So Best Practices #1 is : Strongly type everything, if for no other reason than you'll spot errors at compile-time rather than at runtime.

I'll have another Best Practices question in another thread

Posted: **Sat May 22, 2021 9:35 am**

Hi Joe
Firstly, I must apologise for the absolute garbage I put in my last post. I don’t know what track my mind was on when I wrote it.
Thanks too to Chris and Robert for taking on roles of Garbage Collectors.

Now on to what I should have written:
When your program runs it generates a thread of execution which follows a path from point to point. Each point is displaced from the previous point in both time and distance in just the same way as we go about things in real life. This means we can, if we want to, see everything in real-world terms. And this is what I recommend.
When we get to any point in the real world, we have two options: move quickly on to the next point, or do something useful (taking extra time). We clearly want to structure things so they are done as efficiently as possible. (get to the point at which we do things as directly as possible, and when there, do them as efficiently as possible). Note the two levels of structuring.
Similarly, it is the coding structure of any application at both levels that determines that application’s performance.
Again, there are two factors here:
One: code runs in computer memory which is a finite commodity and has to be shared around. This process leads to memory fragmentation and increasing time overhead when running.
Two: actions doing the same thing may be coded in different ways some taking longer than others.
Moving on to .Net.
.Net implements, fully automatically, a garbage collector which essentially de-fragments memory and re-optimises the code structure it holds. This is a time consuming, complex process, dependent upon factors including when done and degree of fragmentation at the time. It is a process over which we have very limited control and, in most cases, don’t need.
Suffice to say that all this means it is the Garbage Collection Process implemented by the Garbage Collector that predominates application performance.
How does all this reflect back into “code think”.
Well, in C#, an optimal structure comes about, at one level – the most significant, by thinking namespaces and what goes into them.
At another level it is the way you code things. For example, a “for loop” may be more efficient than many of the other ways you could express intent.
It is all trade-offs – easier ways to think versus time.
Structuring your programs optimally is the key and I hope that thinking in terms of real-world actions shows an easy way to do it.
Of course, some of your older Clipper programs may not have had this line of thinking behind them and how it’s all translated through the VOExporter will be difficult to determine. Nevertheless, I hope the above helps in some way.

Terry

Posted: **Mon May 24, 2021 1:47 pm**

Terry, you mentioned the most important word in software development:

THINK

If you're like me, you're presented with a customer need so you develop the solution, test it and if it works on the first few tries, post it and move on. My relation with most customers is that they know if there's a problem with anything I develop, it is fixed quickly -- sometimes within the hour but always within 24 hours. So I don't spend too much time first THINKING and later testing.

Perhaps I should. I have one project that tries to merge two large databases. If first checks name matches and then gets into a labyrinthine sequence of sub-routines to check for things that might establish a match. For example if the name is "Robert van der Hulst" the app first tries to separate into first and last, checking for exceptions, which would include a two functions, which I literally call "Mac" and "Van" to work out the last names beginning with names like "Mc", "Mac", "Van", "vander", "van der". Then if I have a "Bob van der Hulst" in the database another routine is called to resolve the "Bob" as the implict "Robert". Then we deal with the capitalization. Then the postal address is standardized. Then the county is established on the basis of postal code. And so on. Dozens of functions and methods can be called for just one record out of half a million.

The app was first developed in Clipper, and I've updated it many times in VO, where it often slows down on very large databases. It will zip through the first 100k or so and then, after running overnight, its still grinding through the 120s. So obviously something is poorly-conceived. The question is: where ?

Oh. And I have another VO app that handles scheduling and payroll for a large trucking company (700 drivers, 4,000 weekly routes.). There's at least one workstation in use 24/7, that is rarely rebooted and where my app is never closed. No complaints, so I did something right, at least there. (Actually the hero is Robert, who wrote VO 2.8 sp4b )

I think this demonstrates that despite the dot-net's memory-management capabilities, we still need to worry about memory management and program structure as we develop applications. Which is why I ask the basic, seemingly simple-minded questions about constants and strong-typing. (And there will be more.) And also because these issues are important when a simple routine is called hundreds of thousands of times during one session.

As Terry teaches us, THINKING is a big part of the process.

Posted: **Mon May 24, 2021 2:21 pm**

Joe,
couldn't say it better.

BTW, from your writing, i just stumbled across your "Then the postal address is standardized. Then the county is established on the basis of postal code." Wouldn't you better start here, even, at "country"? Depending on your data distribution, i would imagine to shrink the target massively, before you dig into the costly parts...
But that's after only short thinking and having no idea about the data, so feel free to dismiss

Posted: **Mon May 24, 2021 6:26 pm**

Karl,

I was speaking of "counties" of which there are about 3000 in the US. Different sizes and populations. Ohio has 88, Texas 254, California 58, Florida 67 and so on. In New England (Maine, Massachusetts, New Hampshire, Vermont, Connecticut and Rhode Island) they're called "towns" and are mostly quite small. Oh, and there's the problem of overlapping postal codes among counties, and in New Jersey and most of New England the zip codes (postal codes) begin with 0, which are almost always rendered in Excel without the leading zero, i.e. "09508" becomes "9508", which leaves another routine to be run.

Address Standardization is used mostly for duplicate resolution and compatibility with postal-code lookups. For example

"1234 Main Street, NW"
becomes
"1234 Main St NW"

While we're on this subject, I do have another question. I mentioned 3000 postal codes in the US. So I have a 3000-record lookup table (DBF). Here's the basic record (element):
{{ zip code, city, state, county } }
{ { "43756","McConnelsville","OH","58" } }
So my choices are to
dBServer:Seek("43756")
or create an array at the top of the loop and
AScan( aZips, { |x| x[1] == "43756"} )

Which is faster ? Which works best on large routines (100k or more loops) ?

Thank you in advance for your comments.

Posted: **Tue May 25, 2021 4:17 am**

Hi Joe,
an array lookup will be faster because memory is cheap these days, and a serial real of the database (SetOrder(0)) is really fast.
It would be better to not build a twodimensional array, but an array of objects.
Wolfgang

Posted: **Tue May 25, 2021 9:17 am**

Joe
You may like to consider the following:
Everything that is coded up translates into exercising control over the hardware. That “coding up” may be done directly by yourself or someone else via the computer language and constructs it uses.
When searching an array, the time taken depends on how many steps are taken (by the hardware) through that array.
The search process will be most efficient if those steps are contiguous. The number of steps taken will be minimised if the search is stopped once, in this case string, is found.
I could guess, but certainly don’t know, that dBServer.Seek(“…”) does just that, i.e assumes that what you are seeking is unique, and stops when found. The time taken is thus totally dependent on where the searched for string is in the array.
But the question is “how many times are you doing this in your program”?
Would it be worth the overhead of sorting your array once (a highly efficient process in C#)?
Then using that sorted array in subsequent actions.

That way you can ensure you don’t search sections of your array where a match is impossible.
Terry

Posted: **Sun Jul 04, 2021 7:47 am**

OhioJoe wrote: THINK
....
Perhaps I should. I have one project that tries to merge two large databases. If first checks name matches and then gets into a labyrinthine sequence of sub-routines to check for things that might establish a match. For example if the name is "Robert van der Hulst" the app first tries to separate into first and last, checking for exceptions, which would include a two functions, which I literally call "Mac" and "Van" to work out the last names beginning with names like "Mc", "Mac", "Van", "vander", "van der". Then if I have a "Bob van der Hulst" in the database another routine is called to resolve the "Bob" as the implict "Robert". Then we deal with the capitalization. Then the postal address is standardized. Then the county is established on the basis of postal code. And so on. Dozens of functions and methods can be called for just one record out of half a million.

The app was first developed in Clipper, and I've updated it many times in VO, where it often slows down on very large databases. It will zip through the first 100k or so and then, after running overnight, its still grinding through the 120s. So obviously something is poorly-conceived. The question is: where ?
....
THINKING is a big part of the process.

Joe,
couple of Years ago I was tasked to identify duplicates as well as probable family or household members within the databases of a not-too-small insurance company here - of course sporting most of the things you describe as well as some real spelling errors (in name, adress and so on).
You may have heard of the "fault-tolerant" database offerings - but there is no need to switch to them.
If "Levenshtein" or "Jaro-Winkler" does not already ring a bell, google for it or perhaps read on this topic.

https://www.tek-tips.com/viewthread.cfm?qid=1805858

I opted for LevenShtein years before the above discussion (and only saw it when trying to find my stuff)
so if you are ok with the vfp slant of xBase dialect, hop over to

http://fox.wikis.com/wc.dll?Wiki~LevenshteinAlgorithm

about 80% down in the text you'll see the vfp source code (which should be piece of cake to translate to VO / X#) I came up with as previous code was taking too much time. I think you will not find anything faster for the vfp runtime - don't know enough about VO runtime to claim speed king position there

Also bit of warning: My results were so successful, I was given more than 10x data from other insurance branches as well, and then vfp speed was not enough - for that task I translated to C as basic substr() in xBase is major hurdle, but even with a hackish "direct memory access" similar to C/*C# string[n] added to boost vfp code, moving to C gives a sizable boost. Strong typed x# at least reccommended for large datasets.

Of course I had recurring errors - divorced couple living at same adress (10 flat building) they split ownership of when divorcing... but LevenShtein took me miles further in.

And yes, if you set up a bounty to flog those sending out Excel sheets with adresses in columns not predefined as text gobbling up leading zeroes, I'll chip in!

regards
thomas

Posted: **Sun Jul 04, 2021 1:16 pm**

Joe / Thomas
It is a fact that everything you may code has a real-world analogy. So, my way is to think real-world before ever attempting to “go-code”.

Everything to do with computing (programming) logic is in fact a probability. It is impossible to “compute with absolute certainty”, since this would require computing numerical series of infinite length taking infinite time.

This has always been the case, but it is (for me at least) only since the introduction of .Net that the implications of it have been brought, sharply into focus.

What I find interesting, and to some extent illustrative, here is that “LevenShtein” is based on VB which allows “Go To” (I think) i.e. Pointers, the windows O/S is implemented in C, based on Pointers, and the output from the .Net assembles is again generated Pointers.
This is not a co-incidence. Nor should it be a surprise. Iterating through a list is one of the very few things the electronics can do directly.
Thus, instead of trying to understand reams of code (O/S, program and so on), generated in different ways, you can make the assumption that the result is optimal control of the electronics.
Pointers, the most fundamental, flexible, and efficient of these being a simple increment.
Then all you need do is “THINK” how you, personally could effect that same level of control over the electronics.
I think, Thomas, this ties in with the way you did things.

Joe's Original Question: The app was first developed in Clipper, and I've updated it many times in VO, where it often slows down on very large databases. It will zip through the first 100k or so and then, after running overnight, its still grinding through the 120s. So obviously something is poorly-conceived. The question is: where ?

This would very much tie in with searching the data base in its entirety and not exiting on first match being found. Something along those lines. Data non-contiguous perhaps – a potential killer in .Net?

Terry

xsharp.eu

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes

Strongly-Typed methods within classes