MonoDevelop addin: Plumbing and autocompletion

by **nerdzero** » Sat Nov 17, 2012 7:00 pm

The purpose of this post is to detail my findings so far in the quest for implementing code-completion in the Cobra addin for MonoDevelop. The hope is that this will make it easier for other Cobra Commandos to contribute some code to the addin that gets us closer to Ctrl+Space goodness. Of course, contributions other than code such as testing and submitting bug reports or just general feedback are also welcome.

So, I'll discuss briefly how the addin makes itself available to MonoDevelop; how MonoDevelop calls it; then a little bit about compilers, parsing, abstract syntax trees; and then get into the actual code-completion portion and its current state. I ask that even if you already know all about compilers that you don't skip that section so you can keep me honest. Compilers are not my area of expertise.

=== PART 1===
A lot of MonoDevelop's developer documentation is out of date, especially with the transition from version 2.8 to 3.0 as a few interfaces changed. But, the article on how you create an addin is still accurate: http://monodevelop.com/Developers/Artic ... ple_Add-in

You don't need to read that, at least not in detail, but do glance at it. See those XML snippets? Those are parts of an XML file called a manifest. The manifest describes the addin to MonoDevelop and tells it which parts of MonoDevelop it will extend. The Cobra addin also has a manifest. Open this up in a separate tab or window: https://github.com/ramon-rocha/MonoDeve ... .addin.xml See all those <Extension path= lines? These are the various "hooks" that are available to a MonoDevelop addin. There are more available than what we have so far in the Cobra addin but I'll just point out a few important ones. The addin classifies itself in the "Language bindings" category (line 11). This means that the addin will implement the the language binding interface MonoDevelop.Projects.IDotNetLanguageBinding which inherits from the MonoDevelop.Projects.ILanguageBinding interface. You can see the structure of these interfaces here:

https://github.com/mono/monodevelop/blo ... Binding.cs

https://github.com/mono/monodevelop/blo ... Binding.cs

Take a look at line 55 of the Cobra addin manifest. This defines which class in the addin will implement those interfaces, MonoDevelop.CobraBinding.CobraLanguageBinding, the source code for which is here: https://github.com/ramon-rocha/MonoDeve ... ding.cobra

The 'compile' method starting from line 73 is the most interesting thing here as this is what is called when you execute the 'Compile' command from within MonoDevelop. It's well commented so should be easy to see what's going on. It gets the project references, determines what kind assembly to create (an exe or dll), which source files should be included, determines which compiler configuration options have been specified and then launches the cobra compiler in a separate process. When that finishes, it parses the output and adds any errors/warnings to the BuildResult and returns that. Makes sense?

So, at this point, you might be going back to the addin manifest to see what other extension points are defined. There are a few and you should be able to dig around now and figure out the names of the classes and see which interfaces they implement. In some cases, like for project templates, file templates, and syntax highlighting, there are no implementing classes but yet more XML files. If anyone has any questions on this part, feel free to ask.

Now, the next extension point I want to talk about is on line 73 of the addin manifest: MonoDevelop.CobraBinding.TypeSystem.Parser. But before we get to that, what is a parser anyways and what's the purpose of this class? Well, when you use a compiler to turn source code into something that is actually a runnable program, there's a lot of different things that happen.

First the compiler reads in all the source files you have specified and breaks them down into the discrete units of the language. Each unit represents something like a keyword, a type name, a variable name, a number, etc. and is usually represented by one or more non-whitespace characters. Each of these units is referred to as a "token". In Cobra, like Python, leading whitespace is significant as are 'newline' characters so these are considered tokens as well. So, once you have a list of all the tokens in a file, you can start analyzing these tokens to see if you have any syntax errors. If there are no errors, you can create the data structures in memory that represent the elements in the given source code file(s). This entire process is called "parsing". After parsing, you end up with a tree-like data structure that describes the source code. It's called, wait for it, a parse tree! It'll be easier to understand it with an example.

class Sample
    def main
        a = 1
        print a + 2

For this code, the token list looks something like this:

class (Keyword)
Sample (Identifier)
\n (End-Of-Line)
\t (Indent)
def (Keyword)
main (Identifier)
\n (End-Of-Line)
\t (Indent)
a (Identifier)
= (Assignment-Operator)
1 (Integer)
\n (End-Of-Line)
print (Keyword)
a (Identifier)
+ (Plus-Operator)
2 (Integer)
\n (End-Of-Line)
(Dedent)
(Dedent)

And the parse tree looks something like this:

Code: Select all: [Class Declaration (Sample)] | [Methods] | [Method (main)] | [Statements] / \ [Assign Statement] [Print statement] / \ | [Identifier (a)] [Integer (1)] [Binary Math Expression] / | \ [Identifier (a)][ Operator (+)] [Integer (2)]

The parse tree above does not reflect what Cobra really generates, but I hope the idea is clear. After successful parsing, you will end up with a bunch of Nodes that are connected in a tree structure. This also can explain why sometimes when you try and compile some code, an error message comes back stating something about "line 1, column 7: Expecting end-of-line but got 'Foo' identifier" or "line 2, column 2: Unexpected indent". This is because the parser was going through the list of tokens and encountered something that it was not expecting! Like if it just consumed a 'class' keyword token, it expects the next token to be the name of that class (an identifier).

Now, a parse tree is not enough information to determine if the source code is a valid program. It's enough to tell you if you have any syntax errors (unexpected tokens) but it doesn't have any information about the types for the various identifiers in the tree. Why? Well, remember that these trees are being generated one file at a time, or rather, one token stream at a time. If you refer to a class that is defined in another file, or perhaps to a class that is defined in a referenced assembly, you don't know whether or not that class is valid until you have the information for ALL the files that will be compiled plus all the references. Once you have all this information, the next phase in the compilation can start: Semantic analysis.

Semantics are all about how words and symbols are used. For the purposes of a compiler, it's about analyzing the parse tree and determining which types all the various identifiers are, determining if any referenced types are not declared anywhere in the tree, generate errors for mismatch types such as 'cannot assign a Foo to a Bar', possibly generate warnings such as 'variable 'b' in method 'foo' is never used'. You populate this information into the parse tree and possibly remove or add nodes or move them around. The end result is an 'abstract syntax tree'. It's called this because it is an abstract representation of the original source code. Now, finally, after all this work, if there are no errors, the compiler can now translate this tree into an intermediate language, which can optionally then be further optimized, and then ultimately generate a binary file for your program.

A lot happens in that last sentence and we didn't talk about pre-processing or other things a compiler can do, but for the purposes of this post, we don't care about that stuff. Especially since eventually, Cobra uses C# as its intermediate code (or Java) and then hands that off to yet another compiler and the whole things starts all over again

And nevermind the fact that the 'final' binary is actually yet another intermediate language to be compiled by .NET/Mono

You haven't fallen asleep, have you? Okay, good. So, let's get back to the addin. The class 'MonoDevelop.CobraBinding.TypeSystem.Parser' invokes 'Cobra.Compiler.CobraParser.parseSource' to generate the parse tree. See line 62 here: https://github.com/ramon-rocha/MonoDeve ... rser.cobra

Why is this useful? Well, now that we have this tree (or rather if we don't get this tree), we will know if there any syntax errors such as unexpected tokens in the source code and we can have MonoDevelop underline them with red-squiggly lines (line 70). But what if we didn't get any errors? Well, then we got back something called a CobraModule which contains our actual parse tree. Hooray! Now we can use this information for code-completion, right? Well, not exactly. Remember, our parse tree doesn't yet have information about which types the various variables are yet, so if we typed "foo" and then a "." we wouldn't know which methods or member variables to list for completion proposals. However, if we typed just "." or "_" and then wanted some autocompleted members, this is actually possible at this point as we have all the required information (except for inherited members as pointed out by Charles below).

===PART 2===
Let's jump over to the completion extension now to see how we can do this. The code for it is over here: https://github.com/ramon-rocha/MonoDeve ... sion.cobra

It's a little messy so let me explain it. This extension inherits from the MonoDevelop.Ide.Gui.Content.CompletionTextEditorExtension class which in turn inherits from the TextEditorExtension class in the same namespace. You can check these out here:

https://github.com/mono/monodevelop/blo ... tension.cs

https://github.com/mono/monodevelop/blo ... tension.cs

An instance of CobraCompletionTextEditorExtension will be created for each Cobra file in the project. The inherited base class provides a few methods and properties that are useful to us including:

.document - This is an instance of MonoDevelop.Ide.Gui.Document. It acts as the bridge between a project file, the containing project (if any), the text editor and any of its extensions, and the output of the parsing service (i.e. the trees we were discussing earlier). From .document we can use .document.editor to get the current instance of Mono.TextEditor.TextEditorData which gives us some methods for setting or retrieving information such as the text at a certain line, the text in a certain region, where the cursor currently is, and it also has its own .document which is an instance of Mono.TextEditor.Document.TextDocument. Most of the methods in .editor really just pass through to the TextDocument instance it contains.
You can check these out here: https://github.com/mono/monodevelop/blo ... torData.cs

https://github.com/mono/monodevelop/blo ... ocument.cs

.editor - This is just a pass through property that returns .document.editor

.keyPress - This method determines whether the extension should handle the key that was just pressed. The base completion class we inherit from seems to handle things for us so we don't really need to do anything with it as far as I can tell. You'll notice in the cobra addin that I'm just calling the base method. I had some trace statements in there so I could see what it was doing.

.handleCodeCompletion - This method is an override for the method in the base class which really doesn't do anything. It is called when the base class determines that an autocompletion request has been triggerred. It's up to us to return the list of possible completions for the current context in our overridden method. We are provided with an instance of CodeCompletionContext which tells us the line and column number that triggered the completion, it also gives us the character that was last pressed and the length of the word that triggerred the completion. The triggerWordLength is an inout parameter and it's still not quite clear to me how it works. What I've been able to determine is that if the completion character was not a space (e.g. a letter) then you need to add 1 to this value otherwise the letter that was just typed will not be considered part of the word to complete.

Okay, let's back up because it's about to get interesting (finally, right?!). We have created a class called CobraCompletionTextEditorExtension which provides us with a MonoDevelop.Ide.Gui.Document via .document which contains the results of the type system service's call to our parser. You can get to that information from the completion extension via .document.parsedDocument.getAst<of CobraModule>. Our completion extension class also overrides a method in the base class called 'handleCodeCompletion' which gives us line and columns number and the character that triggerred it. So, if 'handleCodeCompletion' is called and the triggering character was a dot or underscore, we can now use the line number from the provided CodeCompletionContext and feed that into our .document.editor methods to get the region the cursor is currently in, and then lookup the members for the class corresponding to that region in the syntax tree of our .document.parsedDocument and then add them to a CompletionDataList and then FINALLY return that from 'handleCodeCompletion' and magically, a list of matching words shows up in MonoDevelop! Wow! But wait a minute, I left out one detail and now we are reaching the limits of my work so far...

Even with a line number and a tree describing all our classes and their members, how do we know which line number corresponds to which class's region? Well, we have to define those regions ourselves. Only then can we make use of the methods in .document.editor to determine which region, or regions, correspond to a given line number. There are many ways to accomplish this and it's not necessarily obvious which way is best. I'll talk about what the current code does and what I've done in the past in experiments.

Look at line 304 of https://github.com/ramon-rocha/MonoDeve ... rser.cobra . It's the body of a method called _addFolds in our parser. However, you feel about the usefulness of folding code, it is useful in one aspect in that you have to define the regions that will fold/unfold and these will typically correspond to the regions for classes and methods. There's room for improvement in this method but it seems to handle most cases correctly including block comments and doc strings which you also need to know the regions for so you don't provide completion results when inside these. The function builds up a list of instances of the MonoDevelop.Ide.TypeSystem.FoldingRegion class each of which contains an instance of the ICSharpCode.NRefactory.TypeSystem.DomRegion class. We'll talk about the NRefactory library later but for now just know that this includes properties for the starting line and column, and the ending line and column for a region. You can see the code for the DomRegion class here: https://github.com/icsharpcode/NRefacto ... mRegion.cs

The list of folds is added to the ParsedDocument along with the AST so it's also available from our completion extension via .document.parsedDocument.foldings. Hopefully, you are getting a picture in your head of one way this could all work. You have the current line number, you find which regions contain that line by iterating through the folds, you determine which AST nodes correspond to those containing regions (maybe a method, a class, and a namespace) and then you return the releveant completion entries.

There's a big problem with this approach. You need to have some way to resolve a region to it's containing node. You could take the starting line for the region, examine the text at that line via .document.editor.getLineText(lineNumber).trimStart and check to see if it starts with "namespace ", "class " or "def ". Yuck! I know, right? This is what the current code was starting to do and it's gross and I hate it.

Another way to generate regions is using the results of the parser. In the addin's Parser, you have a tree of nodes after successful parsing, some of which are declarative and each of which has a token associated with it. That token has line information. Traverse the syntax tree. When you hit a node for a namespace, class, method, etc. you have a starting line so you generate a region for it by reading through the source code and looking for the ending line by checking the indent level. I know this works because I tried it a while ago: https://github.com/ramon-rocha/MonoDeve ... mParser.cs

You now need to decide where you want to stick this Node-to-region relationship such that the completion extension class has access to it because by default it only has access to the AST and Foldings. This can be done either by extending the DefaultParsedDocument class to include a dictionary mapping regions to Nodes, extending the Cobra's SyntaxNode class to include a DomRegion, sticking it in some shared dictionaries somewhere (tried that btw it's hard to test), or probably something way simpler I haven't thought of. I'm thinking now it actually makes sense to create regions both via folds and visiting the AST so it will be easy to equate the two regions and tie them together that way somehow.

Anyways, I got a bit of analysis paralysis at this point so this is where the completion portion of the addin is right now. It is my 4th attempt in case you are keeping track. A few more important pieces which I do not have a strong understanding of remain. This includes making use of the NRefactory library for resolving types and/or running the binding phases of the Cobra Compiler to resolve the types. Remember, we've been talking about just "." and "_" completion so far.

===PART 3===
TODO: Talk about semantic analysis with NRefactory

edit: grammar

by **Charles** » Sun Nov 18, 2012 12:39 am

You said:

However, if we typed just "." or "_" and then wanted some autocompleted members, this is actually possible at this point as we have all the required information.

Even for inherited members? Imagine your class has a base class defined in a library. I think after completing Cobra's parsing, you won't have the information for those. You'll need further phases like binding "use", inheritance and interface.

Btw I have found some minimal utility in editors like TextMate and UltraEdit that can do code completion by simply parsing the words out of a file. There's nothing "intelligent" about it, but it saves some typing, and in editors like "gedit" which don't support it (out of the box), I miss even this small convenience. So whatever completion help can be provided as a stop-gap measure on the way to fully code-intelligent autocompletion will certainly be appreciated, at least by me.

by **kobi7** » Sun Nov 18, 2012 2:02 am

not to dismiss this huge work that you're doing,
but maybe ctags(exuberant) can provide some of this, and be easier than parsing all the types. (it's what vim uses)
(I'm not sure if it does types, though.)
then the MD addin can use it (external), even if it's not the conventional way in monodevelop.

see here:
http://ctags.sourceforge.net/EXTENDING.html

full support then comes later (assuming you're not discouraged),
but in the meantime it will surely reduce dev time and help make cobra more viable.

The work you tackle is admirable to say the least

by **nerdzero** » Sun Nov 18, 2012 4:53 am

Crap, you're right, Charles. Inherited members wouldn't be visible without running some binding phases.

As for word completion without intelligence, this wouldn't be too hard I think. Just visit the tree nodes, collect all the various identifiers, and add them to the completion list. I haven't gotten that far in the write up yet to show how to do this. You can then use reflection to collect all the types from your assembly references. Get the list of Cobra keywords and slap these into the completion list as well. I just spent a little time playing with the autocomplete options in MonoDevelop and this could be usable as long you uncheck "Commit completion list selection with space or punctation" under Edit > Preferences > Text Editor > General. Otherwise it gets really annoying trying to declare new stuff as you keep having to undo the last autocomplete. I could pursue this but let me finish describing what I have so far and what are the challenges that I see for "intelligent" completion first. My dream is that I finish this post describing what I am doing and someone says, "Ramon, you're making this way harder than it needs to be. Just do X Y and Z and we'll have autocompletion" and then I'll be all like "Oh, sweet!".

Thanks for the suggestion about ctags, kobi but I'm not really familiar with it. I worry it would take me quite some time to get up to speed unless someone could give me the quick run down on where to look and what I would have to do. When I asked for help on the MonoDevelop forums about implementing autocompletion I was told it was "Not that much to learn". Well, 3 months later here I am. I think they assumed I knew what I was doing

by **Charles** » Sun Nov 18, 2012 12:27 pm

If you have an editor that supports ctags, then see our wiki page with the appropriate Cobra config:
ExuberantCtags

However, I doubt that this is the correct approach for MonoDevelop.

Ramon, re: simple autocomplete, you wouldn't even need to read the referenced libraries to match what TextMate and UltraEdit are doing, which is just looking at the current file. Also, you might be able to get away with just using the tokenizer, although if the parser gives you the ability to highlight syntax errors, that's an added bonus over those editors.

by **hopscc** » Mon Nov 19, 2012 1:36 am

Ctags is basically a text based key-to-position-in-file mapping driven by a prebuilt table.

All the ctags implementations walk a set of source files extracting 'interesting' position points ( class, function, method, property declarations)
and generate a matching key for them ( classname, functionName,...) and generating a 'database' ( flat file usuallly) containing the tag (text string) and some representation of the file and position in it the corresponding item is declared at ( explicit file and line# or better a RE search string).

tags-aware editors have a keystroke that scans the token the cursor is on, looks up that token in the ctags database and uses the corresponding entry to move editor display to the tagged position ( the declaration point of the tagged item).
(additional keys search for next duplicate and return to origin)

Its a similar idea to a structure editor pane in an IDE but its point-to-point rather than visual structure-tree-to-point.

wrt completion, ctags is the 'other end' of the problem in that it assumes/needs an existing key/tag to find the declaration.
Completion needs the Type of the current item/token (under the cursor) to present a list of possibilities to complete with.
( plus the ctags database isnt hierarchical Type + contents...)

The idea is applicable though
- build ( at some load stage ) a database of Types and their 'interesting' contents ( classname, method, var, property),
at completion point ( '.' press or space) get the token before the cursor - lookup its Type, reference the database and present the 'interesting' contents list as part of that type, prune the presented list as more partial text content is entered.

I'd imagine that if the IDE already has a structure editor display ( generated from walking the AST or parse tree and extracting interesting pieces ) that could act as the database to lookup for completion possibilities.

Typing the token should be just a walk back through the AST to find the declaration for the type of the token before the cursor ( if there isnt one - its a partial entry - look for the previous (identifier) token before a separator).

Both of these could be done by walking the entire syntax tree ( or requesting Assembly/Class parse and walking that tree) but a prebuilt database ( of selected pieces of the full app parse tree ) allows pre-caching (and is useful for other things ( like a structure editor)

Does monodevelop have a structure editor pane ?
(usually its displayed as a tree - first level nodes are things like classes, leaves to that are top level items in the class ( methods, vars, events, props, etc), pressing any one of these moves the display to the declaration point of the item pressed

A text based lookup is just a text search through available files (or just the current file) for a Type declaration corres to the Type of the token being autocompleted, when found slurp (read the text of the class decl) and filter (pull out the high level declared pieces (method, prop, var, etc declarations), optionally filter that by text entered that so far ( if not just a Typed token) and place the resulting text list in in the autocomplete list...

by **nerdzero** » Mon Nov 19, 2012 1:42 pm

I 've updated the original post with more information. If you already read it before you can start from
===PART 2===. Hops, I'll read through your post and respond soon. Thanks.

by **Charles** » Mon Nov 19, 2012 2:54 pm

Re: Part II,

-- I didn't have any "yuck" reaction to checking the start of a line for the declaring keyword (namespace, class, interface, struct, mixin, var, etc.). It's a key characteristic of Cobra that such declarations always start with a keyword that tells you what kind of thing is being declared, followed by the identifier. Nothing like doc strings, attributes, etc. ever precede those two tokens or ever will.

This design was deliberate and you can take advantage of it if that helps you.

-- In the Cobra compiler, interface INode has this:

interface INode
    get addOnValues as Dictionary<of String, dynamic>

Meaning that you can attach whatever data/objects you want to any node at any time. Like so:

someNode.addOnValues['domRegion'] = DomRegion(...)
...
trace someNode.addOnValues['domRegion']

-- Cobra's syntax nodes have a .token and can therefore tell you the starting line number. Would it be helpful if various nodes implemented an .endingToken method? For example, the implementation for Container<of T>, which is the grandaddy of many a node, would return the .endingToken of its last declaration. And so on for various classes.

-- Re: your analysis paralysis, I'm not sure how to help further because I'm not sure what else you are stuck on. Further questions or relaying of issues will help.

-- I'm really jazzed about having autocompletion support in Cobra!

by **nerdzero** » Mon Nov 19, 2012 4:53 pm

Charles wrote:-- I didn't have any "yuck" reaction to checking the start of a line for the declaring keyword (namespace, class, interface, struct, mixin, var, etc.).
...

Anytime I start writing 'if' ... 'else if'...'else if'...'else if'... I have to reevaluate the life decisions that led me to that point. I kid

But, seriously, I haven't really dug into NRefactory yet in my explanation. Both MonoDevelop and #develop use it for type resolution, autocompletion, refactoring, etc.

https://github.com/icsharpcode/NRefactory

http://www.codeproject.com/Articles/408 ... sharp-code

I'm envisioning creating a new AST by visiting the parse tree, decorating those nodes with classes that implement the various TypeSystem interfaces from NRefactory, and implementing a "Resolver" that runs the binding phases of the compiler. This would let me leverage some of the existing infrastructure for resolving types from lines of text. It's quite the dragon for me to slay though.

It's funny you bring up INode.addOnValues because I stumbled on it a few days ago and thought "Hmm, this could be useful". Sounds like the perfect place for the DomRegion.

Charles wrote:-- Cobra's syntax nodes have a .token and can therefore tell you the starting line number. Would it be helpful if various nodes implemented an .endingToken method? For example, the implementation for Container<of T>, which is the grandaddy of many a node, would return the .endingToken of its last declaration. And so on for various classes.

This would be incredibly useful. As would having the position information for doc string and comment tags but I suppose those don't really belong as nodes in an AST. I am already getting this information the hard way right now but I definitely wouldn't mind removing some of that code from the addin.

Charles wrote:-- Re: your analysis paralysis, I'm not sure how to help further because I'm not sure what else you are stuck on. Further questions or relaying of issues will help.

Mostly corner cases that I find as I go along. Any ideas for this? Let's say I have the following code:

class SomeClass
    def method1
        a = 1

    def method 2
        pass

Is that blank line between method1 and method2 part of the region of method1? My gut says no, but if not then it has to be an exception to determining which region you are in because when the cursor is on that blank line, you are probably adding code to method1. If you say it depends on the indent level, then what if there were multiple blank lines? And if it is part of the region for that method I still don't think it should be part of the FoldingRegion because they look really ugly when you include trailing blank lines.

Also, explicit and implicit line continuations have been giving me some trouble for defining the regions. Same with namespace regions that are not indented. I think all these would be solved with having an .endingToken available though. Is that right?

hopscc wrote:Does monodevelop have a structure editor pane ?
(usually its displayed as a tree - first level nodes are things like classes, leaves to that are top level items in the class ( methods, vars, events, props, etc), pressing any one of these moves the display to the declaration point of the item pressed

Yes, it does. I believe this is implemented via an "Ambience" class but I may be way off on that. I will start researching this more. Thanks.

by **Charles** » Mon Nov 19, 2012 5:34 pm

Re: comments and doc strings, I'm not sure how to record and convey that information from the parser. Certainly these couldn't be put in the node tree. Maybe the info about them could be collected in a list which is produced as a side effect of parsing. Would that be helpful?

Re: the blank line between two methods, could you treat it as not being part of the region for folding purposes, but then upon entering code, consider part of the prior region? Just an idea.

You are correct that .endingToken would solve your problems re: line continuations.

MonoDevelop addin: Plumbing and autocompletion

MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Re: MonoDevelop addin: Plumbing and autocompletion

Who is online