Why can't code be uncompiled?

Squizzy@lemmy.world · 1 year ago

Why can't code be uncompiled?

Feyr@lemmy.world · 1 year ago

4+4 is 8 But so is 6+2 And 7+1

You can’t guess which two numbers I started with knowing just the answer

Code is the same, just with much bigger numbers and more of them

Treczoks@lemmy.world · 1 year ago

Very nice explanation of a complex promlem.

MTK@lemmy.world · edit-2 1 year ago

I would say that it’s more like 4+4=8 but the original could have been (1+1+1+1)+(3+1) or (2+2)+(1+2+1) etc.

Basically it’s the same thing but if you really want to understand the code and modify it in any meaningful way you have to know how it was intended and not just the results.

My point being that decompiling does give you something similar to the original. It’s not just a guess that gives you random code with the correct result, but it could be very different from the source code.

The reason is that the compiler does a lot of things to make it more efficient but that just means that while 1+1+1+1 can be efficiently written as 4, there still is a good reason for 1+1+1+1 from a logical sense. For example, if you’re counting something, it would make sense to say 1+1+1+1. But if you’re looking at a specific value, maybe it makes more sense to just say 4.

Dark Arc@social.packetloss.gg · 1 year ago

I actually work on a C++ compiler… I think I should weigh in. The general consensus here that things are lossy is correct but perhaps non-obvious if you’re not familiar with the domain.

When you compile a program you’re taking the source, turning into a graph that represents every aspect of the program, and then generating some kind of IR that then gets turned into machine code.

You lose things like code comments because the machine doesn’t care about the comments right off the bat.

Then you lose local variable and function parameter names because the machine doesn’t care about those things.

Then you lose your class structure … because the machine really just cares about the total size of the thing it’s passing around. You can recover some of this information by looking at the functions but it’s not always going to be straight forward because not every constructor initializes everything and things like unions add further complexity … and not every memory allocation uses a constructor. You won’t get any names of any data members/fields though because … again the machine doesn’t care.

So what you’re left with is basically the mangled names of functions and what you can derive from how instructions access memory.

The mangled names normally tell you a lot, the namespace, the class (if any), and the argument count and types. Of course that’s not guaranteed either, it’s just because that’s how we come up with unique stable names for the various things in your program. It could function with a bunch of UUIDs if you setup a table on the compilers side to associate everything.

But wait! There’s more! The optimizer can do some really wild things in the name of speed… Including combining functions. Those constructors? Gone, now they’re just some more operations in the function bodies. That function you wrote to help improve readability of your code? Gone. That function you wrote to deduplicate code? Gone. That eloquent recursive logic you wrote? Gone, now it’s the moral equivalent of a giant mess of goto statements. That template code that makes use of dozens of instantiated functions? Those functions are gone now too; instead it’s all the instantiated logic puked out into one giant function. That piece of logic computing a value? Well the compiler figured out it’s always 27, so the logic to compute it? Gone.

Now all of that stuff doesn’t happen every time, particularly not all of those things are always possible optimizations or good optimizations … But you can see how incredibly difficult it is to reconstruct a program once it’s been compiled and gone through optimization. There’s a very low chance if you do reconstruct it, that it will look anything like what you started with.

Treczoks@lemmy.world · 1 year ago

Just wait until you see the crazy optimizers for embedded systems. They take the complete code of a system into consideration, and, in a number of compile passes, reuses code snippets from app, libraries, and OS layer to create one big tangled mess that is hard to follow even if you have the source code…

noli@programming.dev · 1 year ago

Isn’t that still the same exact process as a normal compiler except in the case of embedded systems your OS is like a couple kilobytes large and just compiled along with the rest of your code?

As in, are those “crazy optimizations” not just standard compiler techniques, except applied to the entire OS+applications?

morhp@lemmynsfw.com · 1 year ago

The main difference is that when you compile a program for Windows, Linux etc., you have an operating system and kernel with their exposed functions/interfaces so even in a compiled program it’s pretty easy to find the function calls for opening a file, moving a window, etc. (as long as the developer doesn’t add specific steps hiding these calls). But in an embedded system, it’s one large mess without any interfaces apart from those directly on the hardware level.

Treczoks@lemmy.world · 1 year ago

In a way, yes. But it really creates a mess when the linker starts sharing code between your code of which you have sources, and then jumps in the middle of system code for which you don’t have sources. And a pain in the whatever to debug.

noli@programming.dev · 1 year ago

Don’t you have the code in most cases? Like with e.g. freeRTOS? That’s fully open source

Treczoks@lemmy.world · 1 year ago

For a number of reasons people use commercial OSes in this world, too.

noli@programming.dev · 1 year ago

Does commercial mean closed source in this context though? It seems like a waste of resources not to provide the source code for an rtos.

Considering how small in size they tend to be + with their power/computational constraints I can’t imagine they have very effective DRM in place so it shouldn’t take that much to reverse engineer.

May as well just provide the source under some very restrictive license.

Treczoks@lemmy.world · 1 year ago

Yes, it is closed source, but you can buy a “source license”. Which is painfully expensive.

fidodo@lemmy.world · 1 year ago

You can. It’s called decompiling. Problem is you lose all the human friendly metadata that was in the original source code, meaning comments, variable names, certain code structures are lost forever because it was deleted in the compilation process. There are tools to help you reintroduce that stuff by going through the variables and trying to make sense out of what they were for but it’s super tedious. With new ai tech that can certainly be improved with AI guessing what they were for but you’ll never get the original meta data back.

Hjalmar@feddit.nu · 1 year ago

Also if the code was run through an optimizer (which all modern games should be) the code is even harder to make sense of as it doesn’t necessarily have the same structure and the same variables as the original code

BlackPenguins@lemmy.world · 1 year ago

This is also very similar to if they ran the source code through an obfuscation tool. Some people do this with chrome extensions. Since they need to give you the source code for it to work on your machine they just change the variables to a, b, c, d and route things though unneeded functions so you don’t know why anything is happening.

Kirk I. M.@universeodon.com · edit-2 1 year ago

@Squizzy
Lots of other people have addressed this, so I won’t repeat the whole thing. You can absolutely do disassembly work, it’s just a pain in the rear.
But it’s actually been done for Mario, since you brought it up:
https://github.com/IsoFrieze/SMWDisX
And also Pokemon.

Rikudou_Sage@lemmings.world · 1 year ago

As I’ve read somewhere once: it’s easy to make a burger out of a cow. Making a cow out of a burger is slightly harder.

That means that compiling code is a lossy process - the original code is lost in the process and can never be recovered because it doesn’t exist anywhere anymore.

Donebrach@lemmy.world · 1 year ago

This is the fundamental notion of nearly 95% of cyberpunk stories re: the human soul and yet everyone always is like “but I want my cool robot hand!”

Rikudou_Sage@lemmings.world · 1 year ago

Fuck soul, I want my cool robot hand!

howrar@lemmy.ca · 1 year ago

The best and simplest explanation I’ve seen: The machine code tells the computer what to do while the source code tells the human why it’s doing it.

Your computer doesn’t need all the “why” information to run the game, so the compilation process gets rid of it. What you’re left with are instructions on exactly what computations to do, and that’s all the computer needs.

For example, you can see in the machine code that two numbers are being added together. What do those numbers mean and why are we adding them? The source code can tell you that this is code that controls movement, one of the numbers is a velocity, the other is the player’s current position.

amio@kbin.social · 1 year ago

The general difference is that you lose out on metadata - names, comments and organization that helps the source code in whatever programming language make sense, but which is not needed to actually execute the desired behavior on your CPU. Usually stuff like sensible names for bits of your code - functions/reusable logic, storage locations for “health” or “armor” or “current powerup”, movement states, types of objects etc.

However, most of these are just another kind of number to the computer itself, so a lot of compilation processes strip a lot of this information. You could still reverse engineer it, but you’re missing context (like all those names) from the original code and that makes the work potentially pretty difficult. Bear in mind that reading actual original source code is sometimes cryptic enough, then compare “if player is dead, show game over screen” to if (sdfdfgsdfg == jgdfg) { lkghku(); } because the “decompiler” has to invent some kind of name for everything that’s missing. Now you have to deal with thousands of jfdsghklgs, and figure out what it all means.

🇰 🌀 🇱 🇦 🇳 🇦 🇰 ℹ️@yiffit.net · 1 year ago

You can certainly decompile things back down to machine code, but there could be gaps and things lost in translation between the programming language used to create the program, and the machine code that results when you take it apart again.

When you program, like actually write the code, you’re using one language. When you compile it, you’re passing it off to an interpreter into another language. There could be even more layers of this depending on what you’re doing.

Now think about what happens when you open a translator, enter some words, translate it to one language, and then another, and back to the original. It comes out all wrong; the same thing happens with code. There’s nuance and flavor imparted by the language itself that isn’t kept through the interpretation of that language to the language that actually is used by the computer to do its tasks.

Moondance@sh.itjust.works · 1 year ago

The compilation process discards information in the process leaving a many to one effect. A good decompiler allows one to retrieve a program that is functionally equivalent to the source code but not exactly the source code.

CashewNut 🏴󠁢󠁥󠁧󠁿@lemmy.world · 1 year ago

deleted by creator

fenynro@lemmy.world · edit-2 1 year ago

The long answer involves a lot of technical jargon, but the short answer is that the compilation process turns high level source code into something that the machine can read, and that process usually drops a lot of unneeded data and does some low-level optimization to make things more efficient during actual processing.

One can use a decompiler to take that machine code and attempt to turn it back into something human readable, but will usually be missing data on variable names, function calls, comments, etc. and include compiler-added optimizations which makes it nearly impossible to reconstruct the original code

It’s sort of the code equivalent of putting a sentence into Google translate and then immediately translating it back to the original. You often end up with differences in word choice that give you a good general idea of intent, but it’s impossible to know exactly which words were in the original sentence.

ryathal@sh.itjust.works · 1 year ago

Code can be decompiled, but generally the end result isn’t human readable. Just having the decompiler version isn’t that valuable. Having the source code as written is more helpful because you get the context of what things were named and how it was organized.

Decompiled code is a bit like reading a book with all the nouns being random letters and verbs being random numbers.

Valmond@lemmy.mindoki.com · 1 year ago

It can, but it would just be the assembly instructions.

Usually we use high level programming languages when developing software, you’d make the cat class, the dog class, both inheriting from the animal class etc. to make our job easier.

When you compile the code, all the cute stuff gets removed, and the resulting code gets optimized as much as possible, which means you can’t get back to the nice cat and dog code anymore.

A bit like a painter uses thousands of brush strokes to make a painting, it would be very hard to figure out which ones he made to make that specific painting, even if you have access to the painting.

schnurrito@discuss.tchncs.de · 1 year ago

Others have explained that decompiling is a thing.

I mainly work in Java where (due to the way Java bytecode works) decompiled code is actually very close to the original source code.

Most games are written in low level languages like C++ where that is not the case, variable and function names are lost during compilation.

the_q@lemmy.world · 1 year ago

The same reason you can’t unbake a cake I’d imagine.

NoIWontPickaName@kbin.social · 1 year ago

Has the cake been in closed in an airtight container since it was done baking?

the_q@lemmy.world · 1 year ago

I dunno.

NoIWontPickaName@kbin.social · 1 year ago

Potentially, yes, if the answer is yes