คอมพิวเตอร์จะกำหนดประเภทข้อมูลของไบต์ได้อย่างไร


31

ตัวอย่างเช่นหากคอมพิวเตอร์10111100เก็บไว้ในหน่วยความจำไบต์เดียวคอมพิวเตอร์จะทราบได้อย่างไรว่าแปลความหมายนี้เป็นจำนวนเต็มอักขระ ASCII หรืออย่างอื่น ข้อมูลประเภทเก็บอยู่ในไบต์ที่อยู่ติดกันหรือไม่? (ฉันไม่คิดว่าจะเป็นเช่นนี้เพราะจะส่งผลให้มีการใช้พื้นที่สองเท่าสำหรับหนึ่งไบต์)

ฉันสงสัยว่าบางทีคอมพิวเตอร์อาจไม่รู้ประเภทของข้อมูลว่ามีเพียงโปรแกรมที่ใช้เท่านั้นที่รู้ ฉันเดาว่าเป็นเพราะ RAM เป็นR AM และดังนั้นจึงไม่ได้อ่านตามลำดับว่าโปรแกรมเฉพาะบอก CPU เพื่อดึงข้อมูลจากที่อยู่เฉพาะและโปรแกรมกำหนดวิธีการปฏิบัติ ดูเหมือนว่าจะเหมาะกับการเขียนโปรแกรมสิ่งต่าง ๆ เช่นความจำเป็นในการพิมพ์ดีด

ฉันกำลังติดตามใช่ไหม?


4
ในฐานะที่เป็นหมายเหตุด้านข้าง: หากคุณกำลังพูดถึงประเภทคุณต้องทำในบริบทของภาษา มันถูกทิ้งไว้ที่คอมไพเลอร์เพื่อจัดการกับสิ่งนั้น (สัญลักษณ์, ประเภทการตรวจสอบ, การดำเนินการ, แคสต์, ที่อยู่หน่วยความจำ ฯลฯ ) CPU และ RAM รู้จักไบต์เท่านั้น
jean

4
ชนิดข้อมูลของไบต์เป็นไบต์ นอกจากนั้นคอมพิวเตอร์ไม่รู้อะไรเลย โปรแกรมอาจตีความไบต์หรือกลุ่มของไบต์เป็นชนิดข้อมูลเฉพาะและพยายามดำเนินการกับสิ่งเหล่านั้น แต่ไม่มีข้อ จำกัด กลุ่มไบต์เดียวกันสามารถตีความได้ว่าเป็นชนิดข้อมูลมากกว่าหนึ่งประเภท (เช่นการชี้ตัวชี้ไปที่ประเภทค่าสหภาพ C-like ฯลฯ ) RAM นั้นไม่ได้อ่านตามลำดับนั้นไม่เกี่ยวข้องกันจริงๆ - มีมากขึ้นเนื่องจาก RAM มีวัตถุประสงค์ทั่วไป - ตัวอย่างเช่นการลงทะเบียนจะไม่อ่านตามลำดับ แต่จะถูกพิมพ์
BrainSlugs83

5
เสียบไร้ยางอายสำหรับตัวเอง แต่คำถามนี้ถูกถามโดยทั่วไปเกี่ยวกับโปรแกรมเมอร์ SE ประมาณหนึ่งเดือนที่แล้ว นี่คือคำตอบของฉันไป มันมาถึงจุดนี้นานแล้ว แต่ก็โจมตีจากหลาย ๆ มุม
Shaz

2
สิ่งหนึ่งที่มีประโยชน์จากข้อเท็จจริงที่ว่าฮาร์ดแวร์คือผู้ไม่เชื่อเรื่องพระเจ้าประเภทข้อมูลคือไบต์เดียว (หรือคำ ฯลฯ ) สามารถตีความได้หลายวิธีโดยโปรแกรม โดยเฉพาะอย่างยิ่งการตีความชั่วคราวจำนวนจุดลอยตัวเป็นจำนวนเต็มจะใช้ในการคำนวณอย่างรวดเร็วผกผันราก
Aoeuid

@ BrainSlugs83 คุณช่วยลองพิจารณาการแปลงเป็นคำตอบได้ไหม?
DW

คำตอบ:


38

ความสงสัยของคุณถูกต้อง CPU ไม่สนใจเกี่ยวกับความหมายของข้อมูลของคุณ แม้ว่าบางครั้งมันจะสร้างความแตกต่าง ตัวอย่างเช่นการดำเนินการทางคณิตศาสตร์บางอย่างให้ผลลัพธ์ที่แตกต่างเมื่อข้อโต้แย้งมีการลงนามเชิงความหมายหรือไม่ได้ลงนาม ในกรณีนี้คุณต้องบอกซีพียูว่าคุณต้องการตีความอะไร

มันขึ้นอยู่กับโปรแกรมเมอร์ที่จะทำความเข้าใจข้อมูลของเธอ CPU ปฏิบัติตามคำสั่งเท่านั้นโดยไม่รู้ตัวถึงความหมายหรือเป้าหมายของพวกเขาอย่างมีความสุข


1
Regarding "when the arguments are semantically signed or unsigned", how would the CPU know? The CPU operations just see parameter bytes and lack that sort of data type context awareness. You imply the data type by choosing the appropriate CPU operation (or your compiler does).
Shiv

4
@Shiv In such cases, the CPU is actually issued a different instruction to process signed numbers versus unsigned numbers. As in the OP's suspicions, the program is obliged to provide those details, because the CPU is unaware.
Cort Ammon - Reinstate Monica

2
I've been working with computers as long as I remember myself, and even though I know that CPU doesn't care about the high level constructs we use on high level programming, but this separation of concepts still freaks me out from time to time
Loupax

1
@Loupax Well, working with a really low-level assembly helps quite a bit - even mov al, 42 is kind of high-level - it's obvious there's only one possible instruction this could call, but it's still somewhat abstracted away. However, using mov.8 al, 42 explicitly makes this painfully obvious :)
Luaan

1
@Shiv: I'd like to note that there are machines where the data in memory are typed. These are called tagged memory architectures (or simply tagged architectures) but they've not been as successful commercially as regular architectures partly because we now program mostly in compiled languages instead of assembly and the compiler takes care of typing. See: en.wikipedia.org/wiki/Tagged_architecture
slebetman

14

As others have already answered, today's common CPUs do not know what a given memory position contains; the software decides.

However, there are other possibilities. Lisp Machines for example used a tagged architecture which stored the type of each memory position; that way the hardware itself could do some of the work of high-level languages.

And even now, I guess you could consider the NX bit in Intel, AMD, ARM and other architectures to follow the same principle: distinguish at the hardware level whether a given memory zone contains data or instructions.

Also, just for completeness, in Harvard architectures (like some microcontrollers) data and instructions are physically separated, so the CPU does have some idea of what it is reading.

In this Quora question there's some commentary on how the tagged memory worked, its performance implications and demise, and more.


Tagged architecture is an interesting note. Would it be significantly faster?
Bassinator


3

There are no type annotations.
RAM stores pure data, and then program defines what to do.

With CPU registers is a bit harder, if you have registers of given type (like FPU), you tell what is inside.
Operations on floating point registers are explicitly using typed data. You or your compiler tell what and when should be put there, so you not have such freedom.
Computer does not make any assumptions on underlying data in RAM, and in registers with one exception - typed registers in CPU are of known type, optimised to deal with it. This is only to show that there are places where data is to be of expected type, but nothing stops you from casting strings to floats and multiply them.

In programming languages you specify type, or in higher level languages data is general and compiler / interpreter / VM encodes what is inside with overhead.
For example in C your pointer type tells what to do with data, how to access it.

Of course you can read string (characters) and treat then as floating point values, integers and mix them.


Even bits in an FPU register don't always represent floating point values. In the old days (maybe not so much anymore?), a common optimization was to use floating point registers (64-bits or larger) to copy data faster than general purpose/integer registers (32-bit), being twice as big, they were generally able to copy data twice as fast.
Seth

1
I totally agree with you, that is why I wrote somebody might push strings there. And in the same times people did floating point operations on integers, because it was faster. That is the point!
Evil

@HCBPshenanigans there are instructions that manipulate floating-point values. If FADD is used it only makes sense that the (4,8,or 10)-byte groups of memory held floating-point numbers. That's true for several kinds of instruction: multiply two integers only makes sense if they are integers, jump only makes sense if it's an address.
JDługosz

@seth and evilJS that's not assumed to be the case for legacy floating point stacked 8087 instructions, but is the case for newer CIMD registers which may be used just for loading/saving with no interpretation (though they must be aligned), and a caveat that if the CIMD registers were never used than they don't need to be saved in a context switch. If you (only) move 8 bytes via XMM register it's a net loss as the whole set needs to be saved.
JDługosz

3

The CPU doesn't care, it executes assembly code, which justs merely moves data around, shift it, add it or multiply it...

Data Types are a higher level language concept: in C or C++ you need to specify Types for every single piece of data you manipulate; the C/C++ Compiler takes care of transforming these pieces of data into the right commands for the CPU to process (compilers write assembly code)

In some even higher level languages, Types may be inferred: in Python or Javascript, for example, one does not have to specify data types, yet data has a type and you can't add a string with an integer, but you can add a float with an integer: the 'compiler' (which in the case of Javascript is a JIT (Just in Time) Compiler. Javascript is often called an 'interpreted' language because historically browsers interpreted Javascript code, but nowadays Javascript engines are compilers.

Code, always ends up being compiled to machine code, but obviously machine code format depends on the machine you're targeting (x86 64bit code won't work on a x86 32 bits machine or a ARM processor for example)

So there is actually a lot of layers involved in running interpreted code.

Java and C# are other interesting ones, as Java or C# code is technically 'compiled' to a Java binary (bytecode), but that code itself is then interpreted by the Java Runtime, which is specific to the underlying hardware (one needs to install the JRE targeting the right machine to run Java binaries (Jars) )


A compiler compiles, be it JIT or not; and an interpreter interprets without compiling (because if not it would be a compiler!). They are very different things. And regarding "Java being funny" because of bytecode interpretation, consider that even x86 machine code will actually be interpreted (or even compiled?) by the very microprocessor into microcode.
hmijail

Thanks for the clarification... Agreed: a compiler compiles, and an interpreter interprets. In the case of Javascript though the story is a bit complicated since some older browser interpret the code, while more modern browsers actually compile just-in-time, which is probably why it is still referred to as an 'interpreted' language even though it is technically not anymore.
MrE

But AFAIK, JS starts interpreted, and then might get compiled as needed. And JITs can switch from interpreted to compiled to interpreted again, depending on lots of things. For example, a piece of code might get compiled for a variable having a given type; but then the code is run again with that variable having a different type, so the existing compiled code can't be used so the interpreter jumps in - until the code gets compiled again for the new type...
hmijail

You're citing me on something I didn't say, please remove it because it's totally wrong. Microcode has NOTHING to do with the OS; it's something internal to the microprocessor. 32 bit or 64 bit also has nothing to do with it.
hmijail

3

Datatypes are not a hardware feature. The CPU knows a couple (well, a lot) of different commands. Those are called the instruction set of a CPU.

One of the best known ones is the x86 instruction set. If you search for "multiply" on this page, you get 50 results. MULPD and MULSD for the multiplication of doubles, FIMUL for integer multiplication, ...

Those commands work on registers. Registers are memory slots which can contain a fixed number of bits (often 32 or 64, depending on which architecture your CPU uses), no matter what these bits represent. Hence the CPU instruction interprets the values of the registers in a different way, but the values themselves don't have types.

An example was given at PyCon 2017 by Stuart Williams:

enter image description here


1
Note that this isn't strictly true: there are special-purpose registers that can't contain arbitrary values (for example, pointer registers that aren't just any address and don't allow arbitrary additions, or floating point registers where you can't store non-normalized values). But your answer is correct for general-purpose registers on most architectures.
Gilles 'SO- stop being evil'

2

...that a particular program just tells the CPU to fetch the info from a specific address and the program defines how to treat it.

Exactly. But RAM is not read "sequentially", and it stands for Random Access Memory which is exactly the opposite.

Besides knowing what a byte is, you don't even know if it's a byte, or a fragment of a larger item like a floating-point number.

I'd like to add to other answers by giving some specific examples.

Consider 01000001. The program might copy it from one place to another as part of a large parcel of data without any regard to its meaning. But copying that to the address used by the text-mode video buffer will cause the letter A to show in some position on the screen. The exact same action when the card is in a CGA graphics mode will display a red pixel and a blue pixel.

In a register, it could be the number 65 as an integer. Doing arithmetic to set the 32's bit could mean anything without context, but might specifically be changing a letter to lower case.

The 8086 CPU (still) has special instructions called DAA that is used when the register holds 2 decimal digits, so if you just used that instruction you are interpreting it as two digits 41.

Programs crash because a memory word is read thinking it is a pointer when something otherwise was stored there.

Using a debugger, inspecting memory, a map is used to guide the interpretation for display. Without this symbol information, a low-level debugger lets you specify: show this address as 16-bit words, show this address as long floating point, as strings... whatever. Looking at a network packet dump or unknown file format, puzzling it out is a challenge.

That's a major source of power and flexibility in modern computer architecture: a memory cell can mean anything, data or instruction, implicit only in what it "means" to the program by what it does with the value and how that affects subsequent operations. meaning is deeper than integer width: are these characters ... characters in ascii or ebcdic? Forming words in English or SQU product codes? The address to send to or the return address it came from? The lowest level interpretation (logical bits; integer-like, signed or unsigned; float; bcd; pointer) is contextual at the instruction-set level, but you see that it's all context at some level: the to address is what it is because of the location it's printed on the envelope. It is contextual to the rules of the postman, not the CPU. The context is one big continuum, with bits on one end of it.


※ Footnote: The DAA instruction is encoded as a byte 00100111. So that byte is the aforenamed instruction if read in the instruction stream, and the digits 27 if interpreted as bcd digits, and 0x27 = 39 as an integer, which is the numeral 9 in ASCII, and part of the interrupt table (half of INT 13 2-byte address, used for BIOS service routines).


1

The only way the computer knows that a memory location is an instruction is that a special-purpose register called the instruction pointer points to them at one point or another. If the instruction pointer points to a memory word, it is loaded as an instruction. Other than that, the computer has no way of knowing the difference between programs and other types of data.

โดยการใช้ไซต์ของเรา หมายความว่าคุณได้อ่านและทำความเข้าใจนโยบายคุกกี้และนโยบายความเป็นส่วนตัวของเราแล้ว
Licensed under cc by-sa 3.0 with attribution required.