Skip to content Skip to sidebar Skip to footer

How To Detect Source Code In A Text?

Is it possible to detect a programming language source code (primarily Java and C# ) in a text? For example I want to know whether there is any source code part in this text. .. t

Solution 1:

There are some syntax highlighters around (pygments, google-code-prettify) and they've solved code detection and classification. Studying their sources could give an impression how it is done.

(now that I looked at pygments again - I don't know if they can autodetect the programming language. But google-code-prettify definitly can do it)


Solution 2:

You would need a database of keywords with characteristics of those keywords (definition, control structures, etc.), as well as a list of operators, special characters that would be used throughout the languages structure (eg (},*,||), and a list of regex patterns.

The best bet, to reduce iterations, would be to search on the keywords/operators/characters. Using a spacial/frequency formula, only start at text that may be a language, based on the value of the returned formula. Then it's off to identifying what language it is and where it ends.

Because many languages have similar code, this might be hard. Which language is the following?

for(i=0;i<10;i++){
   // for loop
} 

Without the comment it could be many different types of languages. With the comment, you could at least throw out Perl, since it uses # as the comment character, but it could still be JavaScript, C/C++, etc.

Basically, you will need to do a lot of recursive lookups to identify proper code, which means that if you want something quick, you'll need a beast of a computer, or cluster of computers. Additionally, the search formula and identification formula will need to be well refined, for each language.

Code identification without proper library calls or includes may be impossible, unless listing that it could belong to many languages, which you'll need a syntax library for.


Post a Comment for "How To Detect Source Code In A Text?"